Re: [Nutch-general] Recrawling question

Matthew Holt Tue, 06 Jun 2006 15:04:59 -0700

It's writing the segments to a new directory then I believe merging them 
and the index... or am i reading the script wrong?


Stefan Neufeind wrote:

>Oh sorry, I didn't look up the script again from your earlier mail. Hmm,
>I guess you can live fine without the invertlinks (if I'm right). Are
>you sure that your indexing works fine? I think if an index exists nutch
>complains. See if there is any error with indexing. Also maybe try to
>delete your current index before indexing again.
>
>Still doesn't work?
>
>
>Regards,
> Stefan
>
>Matthew Holt wrote:
>  
>
>>Sorry to be asking so many questions.. Below is the current script I'm
>>using. It's indexing the segments.. so do I use invertlinks directly
>>after the fetch? I'm kind of confused.. thanks.
>>matt
>>    
>>
>
>[...]
>
>  
>
>>---------------------------------------------------------------
>>
>>Stefan Neufeind wrote:
>>
>>    
>>
>>>You miss actually indexing the pages :-) This is done inside the
>>>"crawl"-command which does everything in one. After you fetched
>>>everything use:
>>>
>>>nutch invertlinks ...
>>>nutch index ...
>>>
>>>Hope that helps. Otherwise let me know and I'll dig  out the complete
>>>commandlines for you.
>>>
>>>
>>>Regards,
>>>Stefan
>>>
>>>Matthew Holt wrote:
>>> 
>>>
>>>      
>>>
>>>>Just FYI.. After I do the recrawl, I do stop and start tomcat, and still
>>>>the newly created page can not be found.
>>>>
>>>>Matthew Holt wrote:
>>>>
>>>>  
>>>>        
>>>>
>>>>>The recrawl worked this time, and I recrawled the entire db using the
>>>>>-adddays argument (in my case ./recrawl crawl 10 31). However, it
>>>>>didn't find a newly created page.
>>>>>
>>>>>If I delete the database and do the initial crawl over again, the new
>>>>>page is found. Any idea what I'm doing wrong or why it isn't finding
>>>>>it?
>>>>>
>>>>>Thanks!
>>>>>Matt
>>>>>
>>>>>Matthew Holt wrote:
>>>>>
>>>>>    
>>>>>          
>>>>>
>>>>>>Stefan,
>>>>>>Thanks a bunch! I see what you mean..
>>>>>>matt
>>>>>>
>>>>>>Stefan Neufeind wrote:
>>>>>>
>>>>>>      
>>>>>>            
>>>>>>
>>>>>>>Matthew Holt wrote:
>>>>>>>
>>>>>>>
>>>>>>>        
>>>>>>>              
>>>>>>>
>>>>>>>>Hi all,
>>>>>>>>I have already successfuly indexed all the files on my domain only
>>>>>>>>(as
>>>>>>>>specified in the conf/crawl-urlfilter.txt file).
>>>>>>>>
>>>>>>>>Now when I use the below script (./recrawl crawl 10 31) to
>>>>>>>>recrawl the
>>>>>>>>domain, it begins indexing pages off of my domain (such as
>>>>>>>>wikipedia,
>>>>>>>>etc). How do I prevent this? Thanks!
>>>>>>>> 
>>>>>>>>          
>>>>>>>>                
>>>>>>>>
>>>>>>>Hi Matt,
>>>>>>>
>>>>>>>have a look at regex-urlfilter. "crawl" is special in some ways.
>>>>>>>Actually it's "shortcut" for several steps. And it has a special
>>>>>>>urlfilter-file. But if you do it in several steps that
>>>>>>>urlfilter-file is
>>>>>>>no longer used.
>>>>>>>              
>>>>>>>
>
>  
>


_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general

Re: [Nutch-general] Recrawling question

Reply via email to