Yes, MD5ing your urls will randomize the results. Do you need to
access pages by MD5 of URL? If so its unlikely that you also need to
access them by domain.
-Bryan
On Apr 28, 2008, at 4:01 AM, Goel, Ankur wrote:
Hi folks,
I am using HBase table to store my crawled data and
using the
MD5 signature of the canonicalized URL as a row key in HBase. The
bigtable paper suggest using keys appropriately so that URLs from the
same domain are stored close to each other and domain analysis can be
carried out efficiently.
So for e.g. storing page maps.google.com/index.html should use row-key
com.google.maps/index.html.
My question is will using MD5 signature of canonicalized URL hurt data
locality of URLs from same domains ?
Thanks
-Ankur