It will. You could make a two part key where the first half is an md5 of the domain portion of the URL and the second part, the md5 of the URL path portion. Your keys would be wider but domains would sort together.
St.Ack On Mon, Apr 28, 2008 at 4:01 AM, Goel, Ankur <[EMAIL PROTECTED]> wrote: > Hi folks, > I am using HBase table to store my crawled data and using the > MD5 signature of the canonicalized URL as a row key in HBase. The > bigtable paper suggest using keys appropriately so that URLs from the > same domain are stored close to each other and domain analysis can be > carried out efficiently. > So for e.g. storing page maps.google.com/index.html should use row-key > com.google.maps/index.html. > > My question is will using MD5 signature of canonicalized URL hurt data > locality of URLs from same domains ? > > Thanks > -Ankur >
