GitHub user leerho edited a discussion: Quick summary of the key sketch classes in the Java Theta hierarchy
- The **ThetaSketch** is mostly an abstract "interface" with common methods. It has no data and is **immutable**. - The **UpdatableThetaSketch** extends **ThetaSketch** and manages a hash table, which is **mutable**. - The **CompactThetaSketch** compacts the hash table and optionally sorts it and is **immutable**. It is the optimal form for storage and transport. The CompactThetaSketch is the simplest Theta sketch as it only has the cache of hash data and the value _theta_. It does not store the K value, for example, as it doesn't need it. The estimate is simply computed as the size of the cache divided by theta. Depending on how it is derived the size of the cache can be smaller than or larger than K. - The "Direct*" classes allow for off-heap managment of the sketches in both mutable and immutable forms. - The "Empty*" and "Single*" classes are designed for small size (thus speed). Because of the natural power-law distribution of stream sizes prevalent in big data, expect to observe vastly more single item sketches than big ones, thus the importance of optimizing the performance of the small sketches. - The "WrappedCompactCompressedSketch" takes advantage of the fact that sketches created from long streams have lots of leading zeros amongst the hashes, which can be compressed out. This will reduce the size of these sketches by up to 40% or so. - The "Concurrent*" classes were actually an experiment by a scientist from Haifa, where they were able to achieve an order-of-magnitude increase in throughput in processing very long streams with a single sketch (but with lots of threads). This work resulted in a published research paper. I wouldn't bother with it. In general, all of our sketches are single threaded (not thread safe) as concurrency is usually handled at a higher level. It is far simpler to just launch a lot of single threaded sketches in parallel and then merge them at the end. The merge operation is quite fast. - The "*QuickSelect*" classes are named after the [Quickselect](https://en.wikipedia.org/wiki/Quickselect) algorithm, which is used to find a new, smaller value of _theta_, once the cache has reached its maximum size. It is basically half of the QuickSort algorithm I doubt you need all this complexity in Rust, but I'll leave it to you to figure out :) GitHub link: https://github.com/apache/datasketches-rust/discussions/50 ---- This is an automatically sent email for [email protected]. To unsubscribe, please send an email to: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
