Maciej Szymkiewicz created SPARK-19940:
------------------------------------------
Summary: FPGrowthModel.transform should skip duplicated items
Key: SPARK-19940
URL: https://issues.apache.org/jira/browse/SPARK-19940
Project: Spark
Issue Type: Bug
Components: ML
Affects Versions: 2.2.0
Reporter: Maciej Szymkiewicz
Priority: Minor
Due to misplaced {{distinct}} {{FPGrowthModel.transform} generates duplicated
items in the "prediction":
{code}
scala> val data =
spark.read.text("data/mllib/sample_fpgrowth.txt").select(split($"value",
"\\s+").alias("features"))
data: org.apache.spark.sql.DataFrame = [features: array<string>]
scala> val data =
spark.read.text("data/mllib/sample_fpgrowth.txt").select(split($"value",
"\\s+").alias("features"))
data: org.apache.spark.sql.DataFrame = [features: array<string>]
scala> fpm.transform(Seq(Array("t", "s")).toDF("features")).show(1, false)
+--------+---------------------+
|features|prediction |
+--------+---------------------+
|[t, s] |[y, x, z, x, y, x, z]|
+--------+---------------------+
{code}
--
This message was sent by Atlassian JIRA
(v6.3.15#6346)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]