Simeon Simeonov created SPARK-10574:
---------------------------------------

             Summary: HashingTF should use MurmurHash3
                 Key: SPARK-10574
                 URL: https://issues.apache.org/jira/browse/SPARK-10574
             Project: Spark
          Issue Type: Improvement
          Components: MLlib
    Affects Versions: 1.5.0
            Reporter: Simeon Simeonov
            Priority: Critical


{{HashingTF}} uses the Scala native hashing {{##}} implementation. There are 
two significant problems with this.

First, per the [Scala 
documentation|http://www.scala-lang.org/api/2.10.4/index.html#scala.Any] for 
{{hashCode}}, the implementation is platform specific. This means that feature 
vectors created on one platform may be different than vectors created on 
another platform. This can create significant problems when a model trained 
offline is used in another environment for online prediction. The problem is 
made harder by the fact that following a hashing transform features lose 
human-tractable meaning and a problem such as this may be extremely difficult 
to track down.

Second, the native Scala hashing function performs badly on longer strings, 
exhibiting [200-500% higher collision 
rates|https://gist.github.com/ssimeonov/eb12fcda75615e4a8d46] than, for 
example, 
[MurmurHash3|http://www.scala-lang.org/api/2.10.4/#scala.util.hashing.MurmurHash3$]
 which is also included in the standard Scala libraries and is the hashing 
choice of fast learners such as Vowpal Wabbit and others. If Spark users apply 
{{HashingTF}} only to very short, dictionary-like strings the hashing function 
choice will not be a big problem but why have an implementation in MLlib with 
this limitation when there is a better implementation readily available in the 
standard Scala library?

Switching to MurmurHash3 solves both problems. If there is agreement that this 
is a good change, I can prepare a PR. 

Note that changing the hash function would mean that models saved with a 
previous version would have to be re-trained. This introduces a problem that's 
orthogonal to breaking changes in APIs: breaking changes related to artifacts, 
e.g., a saved model, produced by a previous version. Is there a policy or best 
practice currently in effect about this? If not, perhaps we should come up with 
a few simple rules about how we communicate these in release notes, etc.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to