tisonkun commented on code in PR #1: URL: https://github.com/apache/datasketches-rust/pull/1#discussion_r2616636287
########## src/hll/sketch.rs: ########## @@ -0,0 +1,422 @@ +// Licensed to the Apache Software Foundation (ASF) under one +// or more contributor license agreements. See the NOTICE file +// distributed with this work for additional information +// regarding copyright ownership. The ASF licenses this file +// to you under the Apache License, Version 2.0 (the +// "License"); you may not use this file except in compliance +// with the License. You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, +// software distributed under the License is distributed on an +// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +// KIND, either express or implied. See the License for the +// specific language governing permissions and limitations +// under the License. + +//! HyperLogLog sketch implementation +//! +//! This module provides the main [`HllSketch`] struct, which is the primary interface +//! for creating and using HLL sketches for cardinality estimation. +//! +//! # Adaptive Mode System +//! +//! The sketch automatically transitions between three internal modes based on cardinality: +//! +//! - **List mode**: Stores individual coupons in a compact list for small cardinalities. +//! Used when fewer than ~32 unique values have been seen. +//! +//! - **Set mode**: Uses a hash set with open addressing for medium cardinalities. +//! Provides better performance than list mode while still being space-efficient. +//! The set grows dynamically until it reaches K/8 entries. +//! +//! - **HLL mode**: Uses the full HLL array (Array4, Array6, or Array8) for large cardinalities. +//! Provides constant memory usage and accurate estimates for billions of unique values. +//! +//! Mode transitions are automatic and transparent to the user. Each promotion preserves +//! all previously observed values and maintains estimation accuracy. +//! +//! # Serialization +//! +//! Sketches can be serialized and deserialized while preserving all state, including: +//! - Current mode and HLL type +//! - All observed values (coupons or register values) +//! - HIP accumulator state for accurate estimation +//! - Out-of-order flag for merged/deserialized sketches +//! +//! The serialization format is compatible with Apache DataSketches implementations +//! in Java and C++, enabling cross-platform sketch exchange. + +use std::hash::Hash; +use std::io; + +use crate::hll::array4::Array4; +use crate::hll::array6::Array6; +use crate::hll::array8::Array8; +use crate::hll::container::Container; +use crate::hll::hash_set::HashSet; +use crate::hll::list::List; +use crate::hll::serialization::*; +use crate::hll::{HllType, RESIZE_DENOM, RESIZE_NUMER, coupon}; + +/// Current sketch mode +#[derive(Debug, Clone, Copy, PartialEq, Eq)] +enum CurMode { + List = 0, + Set = 1, + Hll = 2, +} + +#[derive(Debug, Clone)] +pub struct HllSketch { + lg_config_k: u8, + mode: Mode, +} Review Comment: So ditto add some docs and examples here perhaps like https://apache.github.io/datasketches-java/8.0.0/org/apache/datasketches/hll/HllSketch.html -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
