Just FYI.. After I do the recrawl, I do stop and start tomcat, and still the newly created page can not be found.
Matthew Holt wrote: > The recrawl worked this time, and I recrawled the entire db using the > -adddays argument (in my case ./recrawl crawl 10 31). However, it > didn't find a newly created page. > > If I delete the database and do the initial crawl over again, the new > page is found. Any idea what I'm doing wrong or why it isn't finding it? > > Thanks! > Matt > > Matthew Holt wrote: > >> Stefan, >> Thanks a bunch! I see what you mean.. >> matt >> >> Stefan Neufeind wrote: >> >>> Matthew Holt wrote: >>> >>> >>>> Hi all, >>>> I have already successfuly indexed all the files on my domain only >>>> (as >>>> specified in the conf/crawl-urlfilter.txt file). >>>> >>>> Now when I use the below script (./recrawl crawl 10 31) to recrawl the >>>> domain, it begins indexing pages off of my domain (such as wikipedia, >>>> etc). How do I prevent this? Thanks! >>>> >>> >>> >>> >>> Hi Matt, >>> >>> have a look at regex-urlfilter. "crawl" is special in some ways. >>> Actually it's "shortcut" for several steps. And it has a special >>> urlfilter-file. But if you do it in several steps that >>> urlfilter-file is >>> no longer used. >>> >>> >>> Regards, >>> Stefan >>> >>> >>> >> > _______________________________________________ Nutch-general mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/nutch-general
