The only link on http://shopthar.com/ to the domain shopthar.com is a
link to http://shopthar.com/. So a crawl starting from that page that
only visits pages in shopthar.com will only find that one page.
% wget -q -O - http://shopthar.com/ | grep shopthar.com
<tr><td colspan=2>Welcome to shopthar.com</td></td></tr>
<a href=http://shopthar.com/>shopthar.com</a> |
Doug
Earl Cahill wrote:
I am trying to do a crawl on trunk of one of my sites,
and it isn't working. I make a file urls, that just
contains the site
http://shopthar.com/
in my conf/crawl-urlfilter.txt I have
+^http://shopthar.com/
I then do
bin/nutch crawl urls -dir crawl.test -depth 100
-threads 20
it kicks in and I get repeating chunks like
051019 010450 Updating
/home/nutch/nutch/trunk/crawl.test/db
051019 010450 Updating for
/home/nutch/nutch/trunk/crawl.test/segments/20051019010449
051019 010450 Finishing update
051019 010450 Update finished
051019 010450 FetchListTool started
051019 010450 Overall processing: Sorted 0 entries in
0.0 seconds.
051019 010450 Overall processing: Sorted NaN
entries/second
051019 010450 FetchListTool completed
051019 010450 logging at INFO
For ages, but I only see two nutch hits in my access
log: one for my robots.txt and one for my front page.
Nothing else.
The "crawl" finishes, then I do a search and can only
get a hits for the front page. When I do the search
via lynx, I get a momentary
Bad partial reference! Stripping lead dots.
I can't imagine this is really the problem, but pretty
well all my links are relative. I mean nutch has to
be able to follow relative links, right?
Ideas?
Thanks,
Earl
__________________________________
Start your day with Yahoo! - Make it your home page!
http://www.yahoo.com/r/hs