Please oh please, don't shoot me for being a newbie. I have set up a site-search using nutch, and I have the crawl-urlfilter.txtfile configured so that everything works properly when I call something similar to:
bin/nutch crawl urls -dir crawl -depth 3 -topN 100 I grabbed the Intranet Recrawl script from http://wiki.apache.org/nutch/IntranetRecrawl I noticed while it was running that nutch was actually grabbing files I didn't want it to grab, and it was also going off site to get others. Obviously I don't want it to do that. On my site, without making a change to the crawl-urlfilter.txt file, nutch is trying to fetch some non-existant files, probably because of some javascript that I have, so I really need my re-crawl to follow my original guidelines. My question is - how can I modify the IntranetRecrawl script so that it follows crawl-urlfilter.txt, or barring that where can I find a documented list of steps to recrawl my site? Thanks, Steve My nutch is at: http://www.stevekallestad.com/search/ in case anybody wanted to check it out. I have the directory proxied through apache which I thought was pretty cool.
