[
https://issues.apache.org/jira/browse/SPARK-36458?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
ASF GitHub Bot updated SPARK-36458:
-----------------------------------
Labels: pull-request-available (was: )
> MinHashLSH.approxSimilarityJoin should not required inputCol if output exist
> ----------------------------------------------------------------------------
>
> Key: SPARK-36458
> URL: https://issues.apache.org/jira/browse/SPARK-36458
> Project: Spark
> Issue Type: Improvement
> Components: ML
> Affects Versions: 3.1.1
> Reporter: Thai Thien
> Priority: Minor
> Labels: pull-request-available
>
> Refer to documents and example code in MinHashLSH
>
> [https://spark.apache.org/docs/latest/ml-features#minhash-for-jaccard-distance]
> The example written that:
> We could avoid computing hashes by passing in the already-transformed
> dataset, e.g. `model.approxSimilarityJoin(transformedA, transformedB, 0.6)`
> However, inputCol still required in transformedA and transformedB even if
> they already have outputCol.
> A code that should work but it doesn't
>
>
> {code:java}
> from pyspark.ml.feature import MinHashLSH
> from pyspark.ml.linalg import Vectors
> from pyspark.sql.functions import col
> dataA = [(0, Vectors.sparse(6, [0, 1, 2], [1.0, 1.0, 1.0]),),
> (1, Vectors.sparse(6, [2, 3, 4], [1.0, 1.0, 1.0]),),
> (2, Vectors.sparse(6, [0, 2, 4], [1.0, 1.0, 1.0]),)]
> dfA = spark.createDataFrame(dataA, ["id", "features"])
> dataB = [(3, Vectors.sparse(6, [1, 3, 5], [1.0, 1.0, 1.0]),),
> (4, Vectors.sparse(6, [2, 3, 5], [1.0, 1.0, 1.0]),),
> (5, Vectors.sparse(6, [1, 2, 4], [1.0, 1.0, 1.0]),)]
> dfB = spark.createDataFrame(dataB, ["id", "features"])
> key = Vectors.sparse(6, [1, 3], [1.0, 1.0])
> mh = MinHashLSH(inputCol="features", outputCol="hashes", numHashTables=5)
> model = mh.fit(dfA)
> transformedA = model.transform(dfA).select("id", "hashes")
> transformedB = model.transform(dfB).select("id", "hashes")
> model.approxSimilarityJoin(transformedA, transformedB, 0.6,
> distCol="JaccardDistance")\
> .select(col("datasetA.id").alias("idA"),
> col("datasetB.id").alias("idB"),
> col("JaccardDistance")).show()
> {code}
> As in the code I give, I discard columns `features` but keep column `hashes`
> which is output data.
> approxSimilarityJoin should only work on `hashes` (the outputCol), which is
> exist and ignore the lack of `features` (the inputCol).
> Be able to transform the data beforehand and remove inputCol can make input
> data much smaller and prevent confusion about the tip "_We could avoid
> computing hashes by passing in the already-transformed dataset_".
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]