>> When I performed a whole-web crawl test according to the tutorial, I got
>> Number of pages: 36668
>> Number of links: 46721.
>> Then how many have you got?

>I only played around with Nutch some month ago, and I got as many as
500.000 
>pages and several million links within a few days over my home DSL line.
Your 
>crawler might be stuck somewhere ...?

Number of pages - it's probably number of Page instances, number of
successfully retrieved web-pages.
Number of links - probably total number of Link instances in WebDB,
including non-retrieved pages, and links to the same Page instance. 

Different pages may have different links (with different anchor text and
even different URL) to the same Page instance; page equality is defined as
MD5 hash (checksum of all bytes in plain HTTP response).

Single page may have hundreds of links, including links to foreign hosts.

Nutch 0.7.1



-------------------------------------------------------
This SF.net email is sponsored by: Splunk Inc. Do you grep through log files
for problems?  Stop!  Download the new AJAX search engine that makes
searching your log files as easy as surfing the  web.  DOWNLOAD SPLUNK!
http://sel.as-us.falkag.net/sel?cmd=lnk&kid=103432&bid=230486&dat=121642
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general

Reply via email to