You miss actually indexing the pages :-) This is done inside the "crawl"-command which does everything in one. After you fetched everything use:
nutch invertlinks ... nutch index ... Hope that helps. Otherwise let me know and I'll dig out the complete commandlines for you. Regards, Stefan Matthew Holt wrote: > Just FYI.. After I do the recrawl, I do stop and start tomcat, and still > the newly created page can not be found. > > Matthew Holt wrote: > >> The recrawl worked this time, and I recrawled the entire db using the >> -adddays argument (in my case ./recrawl crawl 10 31). However, it >> didn't find a newly created page. >> >> If I delete the database and do the initial crawl over again, the new >> page is found. Any idea what I'm doing wrong or why it isn't finding it? >> >> Thanks! >> Matt >> >> Matthew Holt wrote: >> >>> Stefan, >>> Thanks a bunch! I see what you mean.. >>> matt >>> >>> Stefan Neufeind wrote: >>> >>>> Matthew Holt wrote: >>>> >>>> >>>>> Hi all, >>>>> I have already successfuly indexed all the files on my domain only >>>>> (as >>>>> specified in the conf/crawl-urlfilter.txt file). >>>>> >>>>> Now when I use the below script (./recrawl crawl 10 31) to recrawl the >>>>> domain, it begins indexing pages off of my domain (such as wikipedia, >>>>> etc). How do I prevent this? Thanks! >>>>> >>>> >>>> >>>> >>>> Hi Matt, >>>> >>>> have a look at regex-urlfilter. "crawl" is special in some ways. >>>> Actually it's "shortcut" for several steps. And it has a special >>>> urlfilter-file. But if you do it in several steps that >>>> urlfilter-file is >>>> no longer used. _______________________________________________ Nutch-general mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/nutch-general
