Implement a unique hash function for Task ID
--------------------------------------------
Key: DROIDS-53
URL: https://issues.apache.org/jira/browse/DROIDS-53
Project: Droids
Issue Type: New Feature
Components: core
Reporter: Mingfai Ma
For the SimpleTaskQueueWithHistory.previous, CrawlingWorker "// TODO -- make
the hashvalue for Outlink...",
after some research, it seems to smaller hash that we could get easy is the
hashCode() of URL String. It takes 16 bytes only. Java native hash code should
be unique, otherwise, any HashTable or HashMap will be wrong.
if in case we need to represent it in String, we may compress the hashCode in
base62, such as with the following function:
{code}
public static final char[] baseChars =
"0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz".toCharArray();
public static String base62(int hashCode) {
StringBuilder out = new StringBuilder(hashCode == 0 ? "0" : "");
while(hashCode!=0){
out.append(baseChars[hashCode % baseChars.size()])
hashCode = hashCode / baseChars.size()
}
return out.reverse().toString();
}
{code}
e.g.
{code}
assertEquals("c55Ow", base62("http://incubator.apache.org/droids/".hashCode())
);
{code}
for 5 characters, it's 48 bytes.
It is important to get a short, unique hash function for URL. For any crawler
that do not want duplication, we unavoidably have to store a URL hash in memory
or data grid. (unless it is stored on disk or in a database at the cost of
higher look up time) With a hashCode that takes 16bytes, it's around 15M heap
size for 1M URLs.
This task is related to DROIDS-52
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.