Github user merlintang commented on the issue:

    https://github.com/apache/spark/pull/16965
  
    @Yunni Yes, we can use the AND-OR  to increase the possibility by having 
more the numHashTables and numHashFunctions. For the further user extension, if 
users have a hash function with lower possibility, the OR-AND could be used.  
    
    (1) I do not need to change Array[Vector], numHashTables, numHashFunctions, 
we need to change the function to compute the hashDistance (i.e.,hashDistance), 
as well as the sameBucket function in the approxNearestNeighbors.
    
    (3) for the simijoin, I have one question here, if you do a join based on 
the hashed value of input tuples, the joined key would be array(vector). Am i 
right?  if it is, does this meet OR-amplification? please clarify me if I am 
wrong. 
    
    (4) for the index part, I think it would be work. it is pretty similar as 
the routing table idea for the graphx.  thus, I think we can create a other 
data frame with the same partitioner of the input data frame, then, the newly 
created data frame would contain the index for the input tables without 
disturbing the data frame. 
    
    5) the other major concern would be memory overhead, Can we reduce the 
memory usage for the output hash value i.e., array(vector)? Because the users 
said that the current way spent extensive of memory. therefore, one way we can 
do using the bit to respected the hashed value for the min-hash, the other way 
would use the sparse vector.  what do you think ? 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Reply via email to