Thanks! You're the man!!! Now I can automate this thing :).
Steve http://www.stevekallestad.com/ On 2/8/07, chee wu <[EMAIL PROTECTED]> wrote: > The crawl command use "crawl-tool.xml" as default nutch config,but the > recrawl script use "nutch-site". So just copy the all configuration in > "crawl-tool.xml" to "nutch-site.xml". Concerning the selecting of > "crawl-urlfiltertxt",refer the property belowin your "crawl-tool" : > <property> > <name>urlfilter.regex.file</name> > <value>crawl-urlfilter.txt</value> > </property> > > ----- Original Message ----- > From: "Steve Kallestad" <[EMAIL PROTECTED]> > To: <[email protected]> > Sent: Thursday, February 08, 2007 5:17 PM > Subject: Recrawl not following crawl-urlfilter.txt > > > > Please oh please, don't shoot me for being a newbie. > > > > I have set up a site-search using nutch, and I have the > > crawl-urlfilter.txtfile configured so that everything works properly > > when I call something > > similar to: > > > > bin/nutch crawl urls -dir crawl -depth 3 -topN 100 > > > > > > I grabbed the Intranet Recrawl script from > > http://wiki.apache.org/nutch/IntranetRecrawl > > > > I noticed while it was running that nutch was actually grabbing files I > > didn't want it to grab, and it was also going off site to get others. > > Obviously I don't want it to do that. > > > > On my site, without making a change to the crawl-urlfilter.txt file, nutch > > is trying to fetch some non-existant files, probably because of some > > javascript that I have, so I really need my re-crawl to follow my original > > guidelines. > > > > My question is - how can I modify the IntranetRecrawl script so that it > > follows crawl-urlfilter.txt, or barring that where can I find a documented > > list of steps to recrawl my site? > > > > > > Thanks, > > Steve > > > > My nutch is at: > > http://www.stevekallestad.com/search/ > > in case anybody wanted to check it out. I have the directory proxied > > through apache which I thought was pretty cool. > > ------------------------------------------------------------------------- Using Tomcat but need to do more? Need to support web services, security? Get stuff done quickly with pre-integrated technology to make your job easier. Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo http://sel.as-us.falkag.net/sel?cmd=lnk&kid=120709&bid=263057&dat=121642 _______________________________________________ Nutch-general mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/nutch-general
