Re: [I] Generate snapshots programmatically (datasketches-rust)

via GitHub Tue, 16 Dec 2025 22:04:20 -0800


leerho commented on issue #10:
URL: 
https://github.com/apache/datasketches-rust/issues/10#issuecomment-3663796398


   As I mentioned in the [Discussions thread 
](https://github.com/apache/datasketches-rust/discussions/4#discussioncomment-15226056),
 we don't claim to have the perfect solution to the cross-language testing.  
How this evolved is historical.  We first had two languages, C++ & Java when we 
setup the current scheme, Then we added Python, and later Go and now Rust.  We 
clearly need a better solution!
   
   Having a test repo that holds a common set of snapshots seems quite logical. 
It would only be a dependency for tests, so it wouldn't burden the runtime size 
of any of the language component libraries.  
   
   There are some issues to think about.
   
   - Our serialization format is designed for compactness and speed.  
Compactness is critical because there can be millions of these sketch images in 
the memory of large systems and any structural space overhead in these images 
will quickly add up.  Speed is important because we want to be able to 
serialize and deserialize quickly.  In some cases we can merge two sketch 
images without having to deserialize them -- by reading the images directly.  
    
   - For all the sketches the first 3 bytes of the serialized image are the 
same: 
       - **PreLongs (or PreInts)** (byte) gives the size of the metadata region 
of the image.
       - **SerVer** (byte) This is the serialization version of the image 
structure.  This has remained constant for many years, as our goal is to be 
able _read_ any sketch with a prior serialization version for backwards 
compatibility.
       - **FamID** (byte) This is a simple unique sequence number for each of 
the sketch families.  The the enum for this is in the DS-java repo, and so far 
we have 21 sketch families. There can be multiple types of sketches in one 
family, but it is up to the deserializers in each family to figure that out. 
       - The remainder of the first 8 bytes are core values that almost every 
sketch needs. For example, K, flags, and seedHash even if the sketch is empty. 
    
   - Some of our Sketches (e.g., KLL, Classic Quantiles, REQ) are purely 
probabilistic in nature, which means that given the same data in exactly the 
same order will produce a slightly different result -- but it be within the 
confidence interval with a specified probability defined by the configuration 
of the sketch.   For these sketches it is critical that they not be modified to 
produce a deterministic result!
       - For these sketches, the accuracy of the sketch depends on injecting 
randomness into the algorithm.  If you defeat that, we can no longer guarantee 
what the error properties of the sketch will be -- and it will likely be biased 
in an unknown direction by an unknown amount.
       - Because of this our tests do not depend on a byte-for-byte comparison 
of the sketch images.  We test against the probability distribution that the 
sketch is within bounds of the desired result.
   
   
   
    


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [I] Generate snapshots programmatically (datasketches-rust)

Reply via email to