AlexanderSaydakov commented on issue #414: URL: https://github.com/apache/datasketches-java/issues/414#issuecomment-1252897851
Yes, I believe that Theta sketch is state of the art for approximate intersections. The problem is not quite that "cardinality difference is high". A better way to describe this is that Jaccard similarity is very low. It can happen with intersection of large sets with very small overlap too. I am afraid that this is a fundamental problem with approximate set operations. You could try improving accuracy by increasing sketch size, which sort of brings you closer to brute-force "exact" solution. On the other hand, if the overlap of two sets is very small (orders of magnitude smaller than the sets), is it really so important that relative accuracy is bad? Say, intersection of two sets with billion items is one hundred items. Even if the answer is 100% off (say, true answer is 50). Ask yourself whether it is a problem in practice? How much would you pay to have better accuracy? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
