>> When I performed a whole-web crawl test according to the tutorial, I got >> Number of pages: 36668 >> Number of links: 46721. >> Then how many have you got?
>I only played around with Nutch some month ago, and I got as many as 500.000 >pages and several million links within a few days over my home DSL line. Your >crawler might be stuck somewhere ...? Number of pages - it's probably number of Page instances, number of successfully retrieved web-pages. Number of links - probably total number of Link instances in WebDB, including non-retrieved pages, and links to the same Page instance. Different pages may have different links (with different anchor text and even different URL) to the same Page instance; page equality is defined as MD5 hash (checksum of all bytes in plain HTTP response). Single page may have hundreds of links, including links to foreign hosts. Nutch 0.7.1 ------------------------------------------------------- This SF.net email is sponsored by: Splunk Inc. Do you grep through log files for problems? Stop! Download the new AJAX search engine that makes searching your log files as easy as surfing the web. DOWNLOAD SPLUNK! http://sel.as-us.falkag.net/sel?cmd=lnk&kid=103432&bid=230486&dat=121642 _______________________________________________ Nutch-general mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/nutch-general
