shengzhang created SPARK-36714:
----------------------------------

             Summary: bugs in MIniLSH
                 Key: SPARK-36714
                 URL: https://issues.apache.org/jira/browse/SPARK-36714
             Project: Spark
          Issue Type: Improvement
          Components: ML
    Affects Versions: 2.1.1
            Reporter: shengzhang
             Fix For: 2.1.1


This is about MinHashLSH algorithm.
To get the similartiy dataframe DFA and DFB, I  used MinHashLSH  
approxSimilarityJoin function.  But there are some missing data in the result.
the example in documents is no problem  
[https://spark.apache.org/docs/latest/ml-features.html#minhash-for-jaccard-distance|http://example.com]
 
but when the data based on distributed system(hive, more than one node)
there will be some missing data. 
for example    vectora= vectorb. but it no in the reslut of  
approxSimilarityJoin, even though 

"threshold"  more than 1 .

I think  maybe the problem is  in these codes
{code:java}
// part1
override protected[ml] def createRawLSHModel(inputDim: Int): MinHashLSHModel1 = 
{
  require(inputDim <= MinHashLSH.HASH_PRIME,
    s"The input vector dimension $inputDim exceeds the threshold 
${MinHashLSH.HASH_PRIME}.")
  val rand = new Random($(seed)) 
  val randCoefs: Array[(Int, Int)] = Array.fill($(numHashTables)) {
    (1 + rand.nextInt(MinHashLSH.HASH_PRIME - 1), 
rand.nextInt(MinHashLSH.HASH_PRIME - 1))
  }
  new MinHashLSHModel1(uid, randCoefs)
}

// part2
@Since("2.1.0")
override protected[ml] val hashFunction: Vector => Array[Vector] = {
  elems: Vector => {
    require(elems.numNonzeros > 0, "Must have at least 1 non zero entry.")
    val elemsList = elems.toSparse.indices.toList
    val hashValues = randCoefficients.map { case (a, b) =>
      elemsList.map { elem: Int =>
        ((1 + elem) * a + b) % MinHashLSH.HASH_PRIME
      }.min.toDouble
    }
    // TODO: Output vectors of dimension numHashFunctions in SPARK-18450
    hashValues.map(Vectors.dense(_))
  }
{code}

 val r1 = new scala.util.Random(1)

r1.nextInt(1000)  // -> 985

val r2 = new scala.util.Random(2)

r2.nextInt(1000)  // -> 108 - 
val r3 = new scala.util.Random(1)

r3.nextInt(1000)  // -> 985 - because seeded just as `r1`
r3.nextInt(1000).  //-> 588
{{}}the reason maybe is above.  if  random is only  initialized once .  
random.nextInt() will get different result every time ,like r3. 
r3.nextInt(1000) // -> 985   r3.nextInt(1000).  //-> 588

so the code 
val rand = new Random($(seed)) in  def createRawLSHModel  move to hashFunction 
is better
. every worker will initialize random class. and every worker get same data

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to