I now understad the solution of the 'deply same pages' solution reported to JIRA (like:http://www.nb1.hu/galeria/Hun_Ita/reti/m/kepaloldal/m/2001/kepaloldal/m/2002/kepaloldal/m/2002/kepaloldal/m/2002/kepaloldal/m/2001/kepaloldal/m/2001/kepaloldal/m/2001/kepaloldal/m/2001/kepaloldal/m/2001/kepaloldal/m/2002/kepaloldal/m/2002/kepaloldal/m/2002/kepaloldal/m/2002/kepaloldal/m/2002/kepaloldal/m/2002/kepaloldal/m/2002/kepaloldal/m/2002/kepaloldal/m/2002/kepaloldal/m/2002/kepaloldal/m/2001/kepaloldal/m/2002/kepaloldal/m/2001/kepaloldal/1121kisputekep.htm).
I have a understanding problem with it, because the solution maked for an older version off nutch. The reporter only report changed lines and line numbers.
In this case the solution is in the lastest nutch source(UpdateDatabaseTool.java), please commit it to the svn (I think it is work, please see the general changes at 227-230 line):
The problem with this solution is that it does not scale well. It makes a random access to the web db for each link encountered. Normally all access to the web db is batched in order to avoid such random accesses.
I think this is a reasonable option for Nutch to support: skip links from pages that have the same MD5 as another page that's already been seen. These links are already combined in the linkdb, since it represents links as <source_MD5, dest_url> pairs, but each duplicate MD5 does get a new <url,MD5> entry in the page db, and many of these might be ignored.
Supporting this efficiently would require modifications of the WebDBWriter, which is already very complex. Perhaps this should await the MapReduce-based reimplementation of the pagedb.
Another thing to try would be to write a tool that iterates through the pagedb by md5 and deletes pages that are duplicates. That would be scalable.
Cheers,
Doug
------------------------------------------------------- SF email is sponsored by - The IT Product Guide Read honest & candid reviews on hundreds of IT Products from real users. Discover which products truly live up to the hype. Start reading now. http://ads.osdn.com/?ad_id=6595&alloc_id=14396&op=click _______________________________________________ Nutch-developers mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/nutch-developers
