Re: crawl problems

Doug Cutting Wed, 19 Oct 2005 09:41:40 -0700

The only link on http://shopthar.com/ to the domain shopthar.com is alink to http://shopthar.com/. So a crawl starting from that page thatonly visits pages in shopthar.com will only find that one page.


% wget -q -O - http://shopthar.com/ | grep shopthar.com
  <tr><td colspan=2>Welcome to shopthar.com</td></td></tr>
<a href=http://shopthar.com/>shopthar.com</a> |


Doug

Earl Cahill wrote:

I am trying to do a crawl on trunk of one of my sites,
and it isn't working.  I make a file urls, that just
contains the site

http://shopthar.com/

in my conf/crawl-urlfilter.txt I have

+^http://shopthar.com/

I then do

bin/nutch crawl urls -dir crawl.test -depth 100
-threads 20

it kicks in and I get repeating chunks like

051019 010450 Updating
/home/nutch/nutch/trunk/crawl.test/db
051019 010450 Updating for
/home/nutch/nutch/trunk/crawl.test/segments/20051019010449
051019 010450 Finishing update
051019 010450 Update finished
051019 010450 FetchListTool started
051019 010450 Overall processing: Sorted 0 entries in
0.0 seconds.
051019 010450 Overall processing: Sorted NaN
entries/second
051019 010450 FetchListTool completed
051019 010450 logging at INFO

For ages, but I only see two nutch hits in my access

log: one for my robots.txt and one for my front page.Nothing else.


The "crawl" finishes, then I do a search and can only
get a hits for the front page.  When I do the search
via lynx, I get a momentary

Bad partial reference!  Stripping lead dots.

I can't imagine this is really the problem, but pretty
well all my links are relative.  I mean nutch has to
be able to follow relative links, right?

Ideas?

Thanks,
Earl

__________________________________Start your day with Yahoo! - Make it your home page!http://www.yahoo.com/r/hs

Re: crawl problems

Reply via email to