Re: [Nutch-general] Recrawling question

Stefan Neufeind Tue, 06 Jun 2006 14:36:16 -0700

Oh sorry, I didn't look up the script again from your earlier mail. Hmm,
I guess you can live fine without the invertlinks (if I'm right). Are
you sure that your indexing works fine? I think if an index exists nutch
complains. See if there is any error with indexing. Also maybe try to
delete your current index before indexing again.


Still doesn't work?


Regards,
 Stefan

Matthew Holt wrote:
> Sorry to be asking so many questions.. Below is the current script I'm
> using. It's indexing the segments.. so do I use invertlinks directly
> after the fetch? I'm kind of confused.. thanks.
> matt

[...]

> ---------------------------------------------------------------
> 
> Stefan Neufeind wrote:
> 
>> You miss actually indexing the pages :-) This is done inside the
>> "crawl"-command which does everything in one. After you fetched
>> everything use:
>>
>> nutch invertlinks ...
>> nutch index ...
>>
>> Hope that helps. Otherwise let me know and I'll dig  out the complete
>> commandlines for you.
>>
>>
>> Regards,
>> Stefan
>>
>> Matthew Holt wrote:
>>  
>>
>>> Just FYI.. After I do the recrawl, I do stop and start tomcat, and still
>>> the newly created page can not be found.
>>>
>>> Matthew Holt wrote:
>>>
>>>   
>>>> The recrawl worked this time, and I recrawled the entire db using the
>>>> -adddays argument (in my case ./recrawl crawl 10 31). However, it
>>>> didn't find a newly created page.
>>>>
>>>> If I delete the database and do the initial crawl over again, the new
>>>> page is found. Any idea what I'm doing wrong or why it isn't finding
>>>> it?
>>>>
>>>> Thanks!
>>>> Matt
>>>>
>>>> Matthew Holt wrote:
>>>>
>>>>     
>>>>> Stefan,
>>>>> Thanks a bunch! I see what you mean..
>>>>> matt
>>>>>
>>>>> Stefan Neufeind wrote:
>>>>>
>>>>>       
>>>>>> Matthew Holt wrote:
>>>>>>
>>>>>>
>>>>>>         
>>>>>>> Hi all,
>>>>>>> I have already successfuly indexed all the files on my domain only
>>>>>>> (as
>>>>>>> specified in the conf/crawl-urlfilter.txt file).
>>>>>>>
>>>>>>> Now when I use the below script (./recrawl crawl 10 31) to
>>>>>>> recrawl the
>>>>>>> domain, it begins indexing pages off of my domain (such as
>>>>>>> wikipedia,
>>>>>>> etc). How do I prevent this? Thanks!
>>>>>>>  
>>>>>>>           
>>>>>>
>>>>>> Hi Matt,
>>>>>>
>>>>>> have a look at regex-urlfilter. "crawl" is special in some ways.
>>>>>> Actually it's "shortcut" for several steps. And it has a special
>>>>>> urlfilter-file. But if you do it in several steps that
>>>>>> urlfilter-file is
>>>>>> no longer used.


_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general

Re: [Nutch-general] Recrawling question

Reply via email to