Implement a unique hash function for Task ID
--------------------------------------------

                 Key: DROIDS-53
                 URL: https://issues.apache.org/jira/browse/DROIDS-53
             Project: Droids
          Issue Type: New Feature
          Components: core
            Reporter: Mingfai Ma


For the SimpleTaskQueueWithHistory.previous, CrawlingWorker "// TODO -- make 
the hashvalue for Outlink...", 
after some research, it seems to smaller hash that we could get easy is the 
hashCode() of URL String. It takes 16 bytes only. Java native hash code should 
be unique, otherwise, any HashTable or HashMap will be wrong.

if in case we need to represent it in String, we may compress the hashCode in 
base62, such as with the following function:

{code}
  public static final char[] baseChars = 
"0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz".toCharArray();
  public static String base62(int hashCode) {
    StringBuilder out = new StringBuilder(hashCode == 0 ? "0" : "");
    while(hashCode!=0){
      out.append(baseChars[hashCode % baseChars.size()])
      hashCode = hashCode / baseChars.size()
    }
    return out.reverse().toString();
  }
{code}

e.g.
{code}
assertEquals("c55Ow", base62("http://incubator.apache.org/droids/".hashCode()) 
);
{code}

for 5 characters, it's 48 bytes.

It is important to get a short, unique hash function for URL. For any crawler 
that do not want duplication, we unavoidably have to store a URL hash in memory 
or data grid. (unless it is stored on disk or in a database at the cost of 
higher look up time) With a hashCode that takes 16bytes, it's around 15M heap 
size for 1M URLs. 

This task is related to DROIDS-52

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to