Re: [D] Quick summary of the key sketch classes in the Java Theta hierarchy (datasketches-rust)

via GitHub Tue, 30 Dec 2025 11:09:49 -0800


GitHub user leerho edited a discussion: Quick summary of the key sketch classes 
in the Java Theta hierarchy


- The **ThetaSketch** is mostly an abstract "interface" with common methods. It 
has no data and is **immutable**.
- The **UpdatableThetaSketch** extends **ThetaSketch** and manages a hash 
table, which is **mutable**. 
- The **CompactThetaSketch** compacts the hash table and optionally sorts it 
and is **immutable**.  It is the optimal form for storage and transport. The 
CompactThetaSketch is the simplest Theta sketch as it only has the cache of 
hash data and the value _theta_.  It does not store the K value, for example, 
as it doesn't need it.  The estimate is simply computed as the size of the 
cache divided by theta.  Depending on how it is derived the size of the cache 
can be smaller than or larger than K.
- The "Direct*" classes allow for off-heap managment of the sketches in both 
mutable and immutable forms.
- The "Empty*" and "Single*" classes are designed for small size (thus speed).  
Because of the natural power-law distribution of stream sizes prevalent in big 
data, expect to observe vastly more single item sketches than big ones, thus 
the importance of optimizing the performance of the small sketches.
- The "WrappedCompactCompressedSketch" takes advantage of the fact that 
sketches created from long streams have lots of leading zeros amongst the 
hashes, which can be compressed out. This will reduce the size of these 
sketches by up to 40% or so.
- The "Concurrent*" classes were actually an experiment by a scientist from 
Haifa, where they were able to achieve an order-of-magnitude increase in 
throughput in processing very long streams with a single sketch (but with lots 
of threads).  This work resulted in a published research paper.   I wouldn't 
bother with it.  In general, all of our sketches are single threaded (not 
thread safe) as concurrency is usually handled at a higher level.  It is far 
simpler to just launch a lot of single threaded sketches in parallel and then 
merge them at the end.  The merge operation is quite fast. 
- The "*QuickSelect*" classes are named after the 
[Quickselect](https://en.wikipedia.org/wiki/Quickselect) algorithm, which is 
used to find a new, smaller value of _theta_, once the cache has reached its 
maximum size.  It is basically half of the QuickSort algorithm

I doubt you need all this complexity in Rust, but I'll leave it to you to 
figure out :)

GitHub link: https://github.com/apache/datasketches-rust/discussions/50

----
This is an automatically sent email for [email protected].
To unsubscribe, please send an email to: [email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [D] Quick summary of the key sketch classes in the Java Theta hierarchy (datasketches-rust)

Reply via email to