[
https://issues.apache.org/jira/browse/SPARK-23469?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16893258#comment-16893258
]
Huaxin Gao commented on SPARK-23469:
------------------------------------
I will work on this jira once PR https://github.com/apache/spark/pull/25250
(migrate the implementation of HashingTF from MLlib to ML) is merged.
> HashingTF should use corrected MurmurHash3 implementation
> ---------------------------------------------------------
>
> Key: SPARK-23469
> URL: https://issues.apache.org/jira/browse/SPARK-23469
> Project: Spark
> Issue Type: Bug
> Components: ML
> Affects Versions: 2.4.0
> Reporter: Joseph K. Bradley
> Priority: Major
>
> [SPARK-23381] added a corrected MurmurHash3 implementation but left the old
> implementation alone. In Spark 2.3 and earlier, HashingTF will use the old
> implementation. (We should not backport a fix for HashingTF since it would
> be a major change of behavior.) But we should correct HashingTF in Spark
> 2.4; this JIRA is for tracking this fix.
> * Update HashingTF to use new implementation of MurmurHash3
> * Ensure backwards compatibility for ML persistence by having HashingTF use
> the old MurmurHash3 when a model from Spark 2.3 or earlier is loaded. We can
> add a Param to allow this.
> Also, HashingTF still calls into the old spark.mllib.feature.HashingTF, so I
> recommend we first migrate the code to spark.ml: [SPARK-21748]. We can leave
> spark.mllib alone and just fix MurmurHash3 in spark.ml.
--
This message was sent by Atlassian JIRA
(v7.6.14#76016)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]