[
https://issues.apache.org/jira/browse/MAHOUT-1385?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13910299#comment-13910299
]
Johannes Schulte commented on MAHOUT-1385:
------------------------------------------
Hi Manjo.
My point was that these classes are (hopefully) meant to be a performance
improvement, which would be the case if the string's hashcode could be reused
(because the java string object caches it's own hash code).
Using the byte[] hash code is evil because it depends on the reference
Using another library / hashing strategy for the values inside the byte array
is nonsense because this is what we are trying to cache if i understand the
will of the creator correctly.
The more I think about it - was this ever correct? Using the string hash code
as a lookup to the murmurHash-based location? There will be different
collisions leading to other results than with no caching, which should be
avoided?
> Caching Encoders don't cache
> ----------------------------
>
> Key: MAHOUT-1385
> URL: https://issues.apache.org/jira/browse/MAHOUT-1385
> Project: Mahout
> Issue Type: Bug
> Affects Versions: 0.8
> Reporter: Johannes Schulte
> Priority: Minor
> Attachments: MAHOUT-1385-test.patch, MAHOUT-1385.patch
>
>
> The Caching... line of encoders contains code of caching the hash code terms
> added to the vector. However, the method "hashForProbe" inside this classes
> is never called as the signature has String for the parameter original form
> (instead of byte[] like other encoders).
> Changing this to byte[] however would lose the java String internal caching
> of the Strings hash code , that is used as a key in the cache map, triggering
> another hash code calculation.
--
This message was sent by Atlassian JIRA
(v6.1.5#6160)