Re: [Nutch-general] Recrawling question

Matthew Holt Tue, 06 Jun 2006 13:39:12 -0700

The recrawl worked this time, and I recrawled the entire db using the 
-adddays argument (in my case ./recrawl crawl 10 31). However, it didn't 
find a newly created page.


If I delete the database and do the initial crawl over again, the new 
page is found. Any idea what I'm doing wrong or why it isn't finding it?

Thanks!
Matt

Matthew Holt wrote:

> Stefan,
>  Thanks a bunch! I see what you mean..
> matt
>
> Stefan Neufeind wrote:
>
>> Matthew Holt wrote:
>>  
>>
>>> Hi all,
>>>  I have already successfuly indexed all the files on my domain only (as
>>> specified in the conf/crawl-urlfilter.txt file).
>>>
>>> Now when I use the below script (./recrawl crawl 10 31) to recrawl the
>>> domain, it begins indexing pages off of my domain (such as wikipedia,
>>> etc). How do I prevent this? Thanks!
>>>   
>>
>>
>> Hi Matt,
>>
>> have a look at regex-urlfilter. "crawl" is special in some ways.
>> Actually it's "shortcut" for several steps. And it has a special
>> urlfilter-file. But if you do it in several steps that urlfilter-file is
>> no longer used.
>>
>>
>> Regards,
>> Stefan
>>
>>  
>>
>


_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general

Re: [Nutch-general] Recrawling question

Reply via email to