I am trying to do a crawl on trunk of one of my sites,
and it isn't working.  I make a file urls, that just
contains the site

http://shopthar.com/

in my conf/crawl-urlfilter.txt I have

+^http://shopthar.com/

I then do

bin/nutch crawl urls -dir crawl.test -depth 100
-threads 20

it kicks in and I get repeating chunks like

051019 010450 Updating
/home/nutch/nutch/trunk/crawl.test/db
051019 010450 Updating for
/home/nutch/nutch/trunk/crawl.test/segments/20051019010449
051019 010450 Finishing update
051019 010450 Update finished
051019 010450 FetchListTool started
051019 010450 Overall processing: Sorted 0 entries in
0.0 seconds.
051019 010450 Overall processing: Sorted NaN
entries/second
051019 010450 FetchListTool completed
051019 010450 logging at INFO

For ages, but I only see two nutch hits in my access
log: one for my robots.txt and one for my front page. 
Nothing else.

The "crawl" finishes, then I do a search and can only
get a hits for the front page.  When I do the search
via lynx, I get a momentary

Bad partial reference!  Stripping lead dots.

I can't imagine this is really the problem, but pretty
well all my links are relative.  I mean nutch has to
be able to follow relative links, right?

Ideas?

Thanks,
Earl


                
__________________________________ 
Start your day with Yahoo! - Make it your home page! 
http://www.yahoo.com/r/hs

Reply via email to