tdunning commented on issue #409: URL: https://github.com/apache/datasketches-cpp/issues/409#issuecomment-1868164562
The compression on the inputs is not necessary. It may have a positive effect on accuracy in a few cases, but I strongly doubt the effect will be important. In practice, most of the digests being merged will previously have been persisted and thus will have already been compressed. Just allocate a buffer big enough for all the live centroids in all of the merging digests, copy in the centroids and sort/compress. A final copy to a trimmed buffer is a nice touch, but only necessary if the intermediate buffer is massive. An alternative would be to simply pour the centroids in, one digest at a time. Putting in part of a digest at a time will lead to situations where the invariant may be violated transiently until you get the rest of the data in. I haven't analyzed that scenario in detail so I can't say that it will be completely safe. The case where you add an entire t-digest at a time has been analyzed, however, and is safe. This extends to adding multiple complete digests. I like and sympathize with keeping arguments const and might well have preferred that approach were I to do it over again. On Fri, Dec 22, 2023 at 2:23 PM Alexander Saydakov ***@***.***> wrote: > I am looking at the code to merge t-digests. I don't like that a modifying > operation compress() is called on the input t-digests. I wonder whether > this is avoidable. I would think this is done for simplicity to avoid > dealing with unmerged buffer in the incoming t-digests, however the > compress() method forces compression even if the buffer is empty. > > There is a practice in C++ to pass const reference to the input object. > Having a requirement of a mutable input is quite ugly. And forcing > compression unnecessarily is even worse. I am not sure how to deal with > this. > > — > Reply to this email directly, view it on GitHub > <https://github.com/apache/datasketches-cpp/issues/409#issuecomment-1868105273>, > or unsubscribe > <https://github.com/notifications/unsubscribe-auth/AAB5E6RT7HJBQ7GLML3F6P3YKYB5BAVCNFSM6AAAAABANDLXNOVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQNRYGEYDKMRXGM> > . > You are receiving this because you commented.Message ID: > ***@***.***> > -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
