Re: [Nutch-general] Recrawling question

Stefan Neufeind Tue, 06 Jun 2006 14:01:06 -0700

You miss actually indexing the pages :-) This is done inside the
"crawl"-command which does everything in one. After you fetched
everything use:


nutch invertlinks ...
nutch index ...

Hope that helps. Otherwise let me know and I'll dig  out the complete
commandlines for you.


Regards,
 Stefan

Matthew Holt wrote:
> Just FYI.. After I do the recrawl, I do stop and start tomcat, and still
> the newly created page can not be found.
> 
> Matthew Holt wrote:
> 
>> The recrawl worked this time, and I recrawled the entire db using the
>> -adddays argument (in my case ./recrawl crawl 10 31). However, it
>> didn't find a newly created page.
>>
>> If I delete the database and do the initial crawl over again, the new
>> page is found. Any idea what I'm doing wrong or why it isn't finding it?
>>
>> Thanks!
>> Matt
>>
>> Matthew Holt wrote:
>>
>>> Stefan,
>>>  Thanks a bunch! I see what you mean..
>>> matt
>>>
>>> Stefan Neufeind wrote:
>>>
>>>> Matthew Holt wrote:
>>>>  
>>>>
>>>>> Hi all,
>>>>>  I have already successfuly indexed all the files on my domain only
>>>>> (as
>>>>> specified in the conf/crawl-urlfilter.txt file).
>>>>>
>>>>> Now when I use the below script (./recrawl crawl 10 31) to recrawl the
>>>>> domain, it begins indexing pages off of my domain (such as wikipedia,
>>>>> etc). How do I prevent this? Thanks!
>>>>>   
>>>>
>>>>
>>>>
>>>> Hi Matt,
>>>>
>>>> have a look at regex-urlfilter. "crawl" is special in some ways.
>>>> Actually it's "shortcut" for several steps. And it has a special
>>>> urlfilter-file. But if you do it in several steps that
>>>> urlfilter-file is
>>>> no longer used.


_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general

Re: [Nutch-general] Recrawling question

Reply via email to