[jira] [Updated] (SPARK-36458) MinHashLSH.approxSimilarityJoin should not required inputCol if output exist

ASF GitHub Bot (Jira) Wed, 13 May 2026 05:27:35 -0700


     [ 
https://issues.apache.org/jira/browse/SPARK-36458?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


ASF GitHub Bot updated SPARK-36458:
-----------------------------------
    Labels: pull-request-available  (was: )

> MinHashLSH.approxSimilarityJoin should not required inputCol if output exist
> ----------------------------------------------------------------------------
>
>                 Key: SPARK-36458
>                 URL: https://issues.apache.org/jira/browse/SPARK-36458
>             Project: Spark
>          Issue Type: Improvement
>          Components: ML
>    Affects Versions: 3.1.1
>            Reporter: Thai Thien
>            Priority: Minor
>              Labels: pull-request-available
>
> Refer to documents and example code in MinHashLSH 
>  
> [https://spark.apache.org/docs/latest/ml-features#minhash-for-jaccard-distance]
> The example written that:
> We could avoid computing hashes by passing in the already-transformed 
> dataset, e.g. `model.approxSimilarityJoin(transformedA, transformedB, 0.6)`
> However, inputCol still required in transformedA and transformedB even if 
> they already have outputCol.
> A code that should work but it doesn't
>  
>  
> {code:java}
> from pyspark.ml.feature import MinHashLSH
>  from pyspark.ml.linalg import Vectors
>  from pyspark.sql.functions import col
> dataA = [(0, Vectors.sparse(6, [0, 1, 2], [1.0, 1.0, 1.0]),),
>  (1, Vectors.sparse(6, [2, 3, 4], [1.0, 1.0, 1.0]),),
>  (2, Vectors.sparse(6, [0, 2, 4], [1.0, 1.0, 1.0]),)]
>  dfA = spark.createDataFrame(dataA, ["id", "features"])
> dataB = [(3, Vectors.sparse(6, [1, 3, 5], [1.0, 1.0, 1.0]),),
>  (4, Vectors.sparse(6, [2, 3, 5], [1.0, 1.0, 1.0]),),
>  (5, Vectors.sparse(6, [1, 2, 4], [1.0, 1.0, 1.0]),)]
>  dfB = spark.createDataFrame(dataB, ["id", "features"])
> key = Vectors.sparse(6, [1, 3], [1.0, 1.0])
> mh = MinHashLSH(inputCol="features", outputCol="hashes", numHashTables=5)
>  model = mh.fit(dfA)
> transformedA = model.transform(dfA).select("id", "hashes")
>  transformedB = model.transform(dfB).select("id", "hashes")
> model.approxSimilarityJoin(transformedA, transformedB, 0.6, 
> distCol="JaccardDistance")\
>  .select(col("datasetA.id").alias("idA"),
>  col("datasetB.id").alias("idB"),
>  col("JaccardDistance")).show()
> {code}
> As in the code I give, I discard columns `features` but keep column `hashes` 
> which is output data. 
>  approxSimilarityJoin should only work on `hashes` (the outputCol), which is 
> exist and ignore the lack of `features` (the inputCol).
> Be able to transform the data beforehand and remove inputCol can make input 
> data much smaller and prevent confusion about the tip "_We could avoid 
> computing hashes by passing in the already-transformed dataset_".



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Updated] (SPARK-36458) MinHashLSH.approxSimilarityJoin should not required inputCol if output exist

Reply via email to