Re: [Nutch-general] Recrawling question

Matthew Holt Tue, 06 Jun 2006 13:56:58 -0700

Just FYI.. After I do the recrawl, I do stop and start tomcat, and still 
the newly created page can not be found.


Matthew Holt wrote:

> The recrawl worked this time, and I recrawled the entire db using the 
> -adddays argument (in my case ./recrawl crawl 10 31). However, it 
> didn't find a newly created page.
>
> If I delete the database and do the initial crawl over again, the new 
> page is found. Any idea what I'm doing wrong or why it isn't finding it?
>
> Thanks!
> Matt
>
> Matthew Holt wrote:
>
>> Stefan,
>>  Thanks a bunch! I see what you mean..
>> matt
>>
>> Stefan Neufeind wrote:
>>
>>> Matthew Holt wrote:
>>>  
>>>
>>>> Hi all,
>>>>  I have already successfuly indexed all the files on my domain only 
>>>> (as
>>>> specified in the conf/crawl-urlfilter.txt file).
>>>>
>>>> Now when I use the below script (./recrawl crawl 10 31) to recrawl the
>>>> domain, it begins indexing pages off of my domain (such as wikipedia,
>>>> etc). How do I prevent this? Thanks!
>>>>   
>>>
>>>
>>>
>>> Hi Matt,
>>>
>>> have a look at regex-urlfilter. "crawl" is special in some ways.
>>> Actually it's "shortcut" for several steps. And it has a special
>>> urlfilter-file. But if you do it in several steps that 
>>> urlfilter-file is
>>> no longer used.
>>>
>>>
>>> Regards,
>>> Stefan
>>>
>>>  
>>>
>>
>


_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general

Re: [Nutch-general] Recrawling question

Reply via email to