[
https://issues.apache.org/jira/browse/SPARK-10574?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Simeon Simeonov updated SPARK-10574:
------------------------------------
Description:
{{HashingTF}} uses the Scala native hashing {{##}} implementation. There are
two significant problems with this.
First, per the [Scala
documentation|http://www.scala-lang.org/api/2.10.4/index.html#scala.Any] for
{{hashCode}}, the implementation is platform specific. This means that feature
vectors created on one platform may be different than vectors created on
another platform. This can create significant problems when a model trained
offline is used in another environment for online prediction. The problem is
made harder by the fact that following a hashing transform features lose
human-tractable meaning and a problem such as this may be extremely difficult
to track down.
Second, the native Scala hashing function performs badly on longer strings,
exhibiting [200-500% higher collision
rates|https://gist.github.com/ssimeonov/eb12fcda75615e4a8d46] than, for
example,
[MurmurHash3|http://www.scala-lang.org/api/2.10.4/#scala.util.hashing.MurmurHash3$]
which is also included in the standard Scala libraries and is the hashing
choice of fast learners such as Vowpal Wabbit, scikit-learn and others. If
Spark users apply {{HashingTF}} only to very short, dictionary-like strings the
hashing function choice will not be a big problem but why have an
implementation in MLlib with this limitation when there is a better
implementation readily available in the standard Scala library?
Switching to MurmurHash3 solves both problems. If there is agreement that this
is a good change, I can prepare a PR.
Note that changing the hash function would mean that models saved with a
previous version would have to be re-trained. This introduces a problem that's
orthogonal to breaking changes in APIs: breaking changes related to artifacts,
e.g., a saved model, produced by a previous version. Is there a policy or best
practice currently in effect about this? If not, perhaps we should come up with
a few simple rules about how we communicate these in release notes, etc.
was:
{{HashingTF}} uses the Scala native hashing {{##}} implementation. There are
two significant problems with this.
First, per the [Scala
documentation|http://www.scala-lang.org/api/2.10.4/index.html#scala.Any] for
{{hashCode}}, the implementation is platform specific. This means that feature
vectors created on one platform may be different than vectors created on
another platform. This can create significant problems when a model trained
offline is used in another environment for online prediction. The problem is
made harder by the fact that following a hashing transform features lose
human-tractable meaning and a problem such as this may be extremely difficult
to track down.
Second, the native Scala hashing function performs badly on longer strings,
exhibiting [200-500% higher collision
rates|https://gist.github.com/ssimeonov/eb12fcda75615e4a8d46] than, for
example,
[MurmurHash3|http://www.scala-lang.org/api/2.10.4/#scala.util.hashing.MurmurHash3$]
which is also included in the standard Scala libraries and is the hashing
choice of fast learners such as Vowpal Wabbit and others. If Spark users apply
{{HashingTF}} only to very short, dictionary-like strings the hashing function
choice will not be a big problem but why have an implementation in MLlib with
this limitation when there is a better implementation readily available in the
standard Scala library?
Switching to MurmurHash3 solves both problems. If there is agreement that this
is a good change, I can prepare a PR.
Note that changing the hash function would mean that models saved with a
previous version would have to be re-trained. This introduces a problem that's
orthogonal to breaking changes in APIs: breaking changes related to artifacts,
e.g., a saved model, produced by a previous version. Is there a policy or best
practice currently in effect about this? If not, perhaps we should come up with
a few simple rules about how we communicate these in release notes, etc.
> HashingTF should use MurmurHash3
> --------------------------------
>
> Key: SPARK-10574
> URL: https://issues.apache.org/jira/browse/SPARK-10574
> Project: Spark
> Issue Type: Improvement
> Components: MLlib
> Affects Versions: 1.5.0
> Reporter: Simeon Simeonov
> Priority: Critical
> Labels: HashingTF, hashing, mllib
>
> {{HashingTF}} uses the Scala native hashing {{##}} implementation. There are
> two significant problems with this.
> First, per the [Scala
> documentation|http://www.scala-lang.org/api/2.10.4/index.html#scala.Any] for
> {{hashCode}}, the implementation is platform specific. This means that
> feature vectors created on one platform may be different than vectors created
> on another platform. This can create significant problems when a model
> trained offline is used in another environment for online prediction. The
> problem is made harder by the fact that following a hashing transform
> features lose human-tractable meaning and a problem such as this may be
> extremely difficult to track down.
> Second, the native Scala hashing function performs badly on longer strings,
> exhibiting [200-500% higher collision
> rates|https://gist.github.com/ssimeonov/eb12fcda75615e4a8d46] than, for
> example,
> [MurmurHash3|http://www.scala-lang.org/api/2.10.4/#scala.util.hashing.MurmurHash3$]
> which is also included in the standard Scala libraries and is the hashing
> choice of fast learners such as Vowpal Wabbit, scikit-learn and others. If
> Spark users apply {{HashingTF}} only to very short, dictionary-like strings
> the hashing function choice will not be a big problem but why have an
> implementation in MLlib with this limitation when there is a better
> implementation readily available in the standard Scala library?
> Switching to MurmurHash3 solves both problems. If there is agreement that
> this is a good change, I can prepare a PR.
> Note that changing the hash function would mean that models saved with a
> previous version would have to be re-trained. This introduces a problem
> that's orthogonal to breaking changes in APIs: breaking changes related to
> artifacts, e.g., a saved model, produced by a previous version. Is there a
> policy or best practice currently in effect about this? If not, perhaps we
> should come up with a few simple rules about how we communicate these in
> release notes, etc.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]