Yeah, the big link on the homepage is <a href=/sitemap.html>browse</a>
which then opens several other pages. All the links on the site will with /. So I tried +^/ in my conf/crawl-urlfilter.txt with no luck. Against my better judgement, I also tried +^/.* which also didn't work. Thanks, Earl --- Doug Cutting <[EMAIL PROTECTED]> wrote: > The only link on http://shopthar.com/ to the domain > shopthar.com is a > link to http://shopthar.com/. So a crawl starting > from that page that > only visits pages in shopthar.com will only find > that one page. > > % wget -q -O - http://shopthar.com/ | grep > shopthar.com > <tr><td colspan=2>Welcome to > shopthar.com</td></td></tr> > <a href=http://shopthar.com/>shopthar.com</a> | > > Doug > > Earl Cahill wrote: > > I am trying to do a crawl on trunk of one of my > sites, > > and it isn't working. I make a file urls, that > just > > contains the site > > > > http://shopthar.com/ > > > > in my conf/crawl-urlfilter.txt I have > > > > +^http://shopthar.com/ > > > > I then do > > > > bin/nutch crawl urls -dir crawl.test -depth 100 > > -threads 20 > > > > it kicks in and I get repeating chunks like > > > > 051019 010450 Updating > > /home/nutch/nutch/trunk/crawl.test/db > > 051019 010450 Updating for > > > /home/nutch/nutch/trunk/crawl.test/segments/20051019010449 > > 051019 010450 Finishing update > > 051019 010450 Update finished > > 051019 010450 FetchListTool started > > 051019 010450 Overall processing: Sorted 0 entries > in > > 0.0 seconds. > > 051019 010450 Overall processing: Sorted NaN > > entries/second > > 051019 010450 FetchListTool completed > > 051019 010450 logging at INFO > > > > For ages, but I only see two nutch hits in my > access > > log: one for my robots.txt and one for my front > page. > > Nothing else. > > > > The "crawl" finishes, then I do a search and can > only > > get a hits for the front page. When I do the > search > > via lynx, I get a momentary > > > > Bad partial reference! Stripping lead dots. > > > > I can't imagine this is really the problem, but > pretty > > well all my links are relative. I mean nutch has > to > > be able to follow relative links, right? > > > > Ideas? > > > > Thanks, > > Earl > > > > > > > > __________________________________ > > Start your day with Yahoo! - Make it your home > page! > > http://www.yahoo.com/r/hs > __________________________________ Yahoo! Music Unlimited Access over 1 million songs. Try it free. http://music.yahoo.com/unlimited/
