Github user merlintang commented on the issue:
https://github.com/apache/spark/pull/16965
@Yunni Yes, we can use the AND-OR to increase the possibility by having
more the numHashTables and numHashFunctions. For the further user extension, if
users have a hash function with lower possibility, the OR-AND could be used.
(1) I do not need to change Array[Vector], numHashTables, numHashFunctions,
we need to change the function to compute the hashDistance (i.e.,hashDistance),
as well as the sameBucket function in the approxNearestNeighbors.
(3) for the simijoin, I have one question here, if you do a join based on
the hashed value of input tuples, the joined key would be array(vector). Am i
right? if it is, does this meet OR-amplification? please clarify me if I am
wrong.
(4) for the index part, I think it would be work. it is pretty similar as
the routing table idea for the graphx. thus, I think we can create a other
data frame with the same partitioner of the input data frame, then, the newly
created data frame would contain the index for the input tables without
disturbing the data frame.
5) the other major concern would be memory overhead, Can we reduce the
memory usage for the output hash value i.e., array(vector)? Because the users
said that the current way spent extensive of memory. therefore, one way we can
do using the bit to respected the hashed value for the min-hash, the other way
would use the sparse vector. what do you think ?
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]