[ 
https://issues.apache.org/jira/browse/MAHOUT-1385?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13910299#comment-13910299
 ] 

Johannes Schulte commented on MAHOUT-1385:
------------------------------------------

Hi Manjo.

My point was that these classes are (hopefully) meant to be a performance 
improvement, which would be the case if the string's hashcode could be reused 
(because the java string object caches it's own hash code).

Using the byte[] hash code is evil because it depends on the reference
Using another library / hashing strategy for the values inside the byte array 
is nonsense because this is what we are trying to cache if i understand the 
will of the creator correctly.

The more I think about it - was this ever correct? Using the string hash code 
as a lookup to the murmurHash-based location? There will be different 
collisions leading to other results than with no caching, which should be 
avoided?

> Caching Encoders don't cache
> ----------------------------
>
>                 Key: MAHOUT-1385
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-1385
>             Project: Mahout
>          Issue Type: Bug
>    Affects Versions: 0.8
>            Reporter: Johannes Schulte
>            Priority: Minor
>         Attachments: MAHOUT-1385-test.patch, MAHOUT-1385.patch
>
>
> The Caching... line of encoders contains code of caching the hash code terms 
> added to the vector. However, the method "hashForProbe" inside this classes 
> is never called as the signature has String for the parameter original form 
> (instead of byte[] like other encoders).
> Changing this to byte[] however would lose the java String internal caching 
> of the Strings hash code , that is used as a key in the cache map, triggering 
> another hash code calculation.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

Reply via email to