[
https://issues.apache.org/jira/browse/SPARK-10574?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14949883#comment-14949883
]
Joseph K. Bradley commented on SPARK-10574:
-------------------------------------------
Hi [~simeons] Thanks for your thoughts, and sorry about the slow response!
1. It would be great to support any data type, though I think it would be OK to
restrict ourselves to a subset (e.g., Catalyst types) if necessary.
2. I agree it will be hard to support arbitrary types if we go beyond native
hashing. But I haven't really heard of anyone using HashingTF for something
which MurmurHash3 would not handle. I figure we can support a reasonable set
of types for now and worry about supporting fancy types later through
pluggability (your 3rd comment).
3. I agree we should do this in the long-term. Defining a trait, and providing
a few default implementations sounds good. We could do this right away, or a
little farther in the future (in which case we could still provide MurmurHash3
+ native hashing via a String option). The main reason to delay adding a trait
would be the dev cost of making sure it works with Scala and Java; we would
need to add little unit tests. For either option, we can start by providing
MurmurHash3 (as a default) and the old native hashing method. Note that this
means changing the default behavior, but based on your tests from your gist,
that seems pretty reasonable (and users could choose to use the old hashing
method as an option).
One question about your tests: Do you have a sense of how fast each hashing
method is? Is one significantly slower?
Assuming MurmurHash3 is reasonably fast, I'd vote for an initial PR which
provides setters using Strings (no trait), with a follow-up PR to add a trait:
{code}
def setHash(method: String)
{code}
Does that sound good?
> HashingTF should use MurmurHash3
> --------------------------------
>
> Key: SPARK-10574
> URL: https://issues.apache.org/jira/browse/SPARK-10574
> Project: Spark
> Issue Type: Improvement
> Components: MLlib
> Affects Versions: 1.5.0
> Reporter: Simeon Simeonov
> Priority: Critical
> Labels: HashingTF, hashing, mllib
>
> {{HashingTF}} uses the Scala native hashing {{##}} implementation. There are
> two significant problems with this.
> First, per the [Scala
> documentation|http://www.scala-lang.org/api/2.10.4/index.html#scala.Any] for
> {{hashCode}}, the implementation is platform specific. This means that
> feature vectors created on one platform may be different than vectors created
> on another platform. This can create significant problems when a model
> trained offline is used in another environment for online prediction. The
> problem is made harder by the fact that following a hashing transform
> features lose human-tractable meaning and a problem such as this may be
> extremely difficult to track down.
> Second, the native Scala hashing function performs badly on longer strings,
> exhibiting [200-500% higher collision
> rates|https://gist.github.com/ssimeonov/eb12fcda75615e4a8d46] than, for
> example,
> [MurmurHash3|http://www.scala-lang.org/api/2.10.4/#scala.util.hashing.MurmurHash3$]
> which is also included in the standard Scala libraries and is the hashing
> choice of fast learners such as Vowpal Wabbit, scikit-learn and others. If
> Spark users apply {{HashingTF}} only to very short, dictionary-like strings
> the hashing function choice will not be a big problem but why have an
> implementation in MLlib with this limitation when there is a better
> implementation readily available in the standard Scala library?
> Switching to MurmurHash3 solves both problems. If there is agreement that
> this is a good change, I can prepare a PR.
> Note that changing the hash function would mean that models saved with a
> previous version would have to be re-trained. This introduces a problem
> that's orthogonal to breaking changes in APIs: breaking changes related to
> artifacts, e.g., a saved model, produced by a previous version. Is there a
> policy or best practice currently in effect about this? If not, perhaps we
> should come up with a few simple rules about how we communicate these in
> release notes, etc.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]