leerho commented on issue #481:
URL:
https://github.com/apache/datasketches-java/issues/481#issuecomment-1842232679
@edmondliuTTD
Thank you for your contribution and you interest in our library. It is
clear you have studied the KLL code somewhat, however, studying your changes
for just a few minutes, I can see that you are having to touch several very
critical areas of the code, which means we will have to study what you are
attempting to do quite carefully. I am not discounting your idea, out-of-hand,
but I do have some more questions for you:
Sketches are probabilistic state machines and in order to fully understand
their behavior over a wide range of possible inputs we have to perform
exhaustive testing that stresses the sketch with millions of trials.
- So my first question is what testing have you done that could begin to
establish that your changes do not violate the statistical properties of the
KLL algorithm?
- Have you tested update speed performance, space performance, and merge
accuracy and speed performance?
- It appears that you are not taking advantage of the logarithmic properties
of the levels array. This means that your performance improvement may not be
much better than externally feeding the sketch from a loop.
- It appears that some of your changes are a matter of style and have
nothing to do with your targeted changes, e.g., moving static imports from the
top of the set of imports to the bottom. Or replacing our multiple specific
static imports with "*" imports, which is not a good practice. This creates
unnecessary review overload for us, as we have to ask, why did you do that?
- We would like to know more about the specific problem you are trying to
solve. And is the lack of being able to update multiple identical items at once
a severe impediment? Can you give us some examples? How big is your total
data set? what fraction of these are duplicates? How critical is the update
speed performance?
I'm sure we will have more comments and questions as we dig deeper into this.
Cheers
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]