devanbenz commented on code in PR #10153:
URL: https://github.com/apache/arrow-rs/pull/10153#discussion_r3425037947


##########
parquet/src/encodings/fsst.rs:
##########
@@ -0,0 +1,443 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+//! FSST (Fast Static Symbol Table) string compression.
+//!
+//! FSST compresses short strings by replacing frequently occurring substrings
+//! (up to [`FSST_MAX_SYMBOL_LEN`] bytes) with single-byte codes drawn from a
+//! small static symbol table. Because the table is static for a block of
+//! values, individual compressed strings can be decompressed independently,
+//! which supports random access into the encoded data.
+//!
+//! A compressed stream is a sequence of 1-byte codes. A code in `0..n_symbols`
+//! expands to the symbol with that index; the reserved code [`FSST_ESCAPE`]
+//! (`255`) indicates that the following byte is an uncompressed literal. The
+//! symbol table is therefore limited to [`FSST_MAX_SYMBOLS`] entries.
+//!
+//! The symbol table is built by [`SymbolTable::train`], which implements the
+//! generational construction from the [FSST paper] ยง4.3: starting from an 
empty
+//! table, it repeatedly compresses a sample, counts how often each symbol and
+//! each pair of adjacent symbols occurs, and rebuilds the table from the
+//! highest-gain symbols (and concatenations of adjacent pairs).
+//!
+//! [FSST paper]: https://www.vldb.org/pvldb/vol13/p2649-boncz.pdf
+
+use std::collections::HashMap;
+
+use crate::errors::{ParquetError, Result};
+
+/// Escape code: the byte immediately following it is an uncompressed literal.
+pub(crate) const FSST_ESCAPE: u8 = 255;
+
+/// Maximum number of symbols in a table. Codes `0..=254` address symbols;
+/// code `255` ([`FSST_ESCAPE`]) is reserved for escaped literals.
+pub(crate) const FSST_MAX_SYMBOLS: usize = 255;
+
+/// Maximum length, in bytes, of a single symbol.
+pub(crate) const FSST_MAX_SYMBOL_LEN: usize = 8;
+
+/// FSST symbol-table format version, stored in the upper 32 bits of the 
leading
+/// `u64` of a serialized symbol table.
+const FSST_VERSION: u64 = 20190218;
+
+/// Fixed-size prelude of a serialized symbol table: version (`u64`) +
+/// zero-terminated flag (`u8`) + length histogram (`[u8; 8]`).
+const FSST_SYMBOL_TABLE_PRELUDE_LEN: usize = 8 + 1 + 8;
+
+/// Identifier written in the page header for the encoding used by the length
+/// array. Matches `Encoding::DELTA_BINARY_PACKED`'s discriminant.
+pub(crate) const FSST_LENGTH_ENCODING_DELTA: u8 = 5;
+
+/// Number of passes used to grow the symbol table during training.
+const TRAINING_GENERATIONS: usize = 5;
+
+/// Cap on the number of sample bytes inspected per training generation, to 
keep
+/// training time bounded on large pages.
+const TRAINING_SAMPLE_BUDGET: usize = 1 << 20;
+
+/// Pack up to 8 bytes into a `u64` key (little-endian, zero-padded).
+fn pack(bytes: &[u8]) -> u64 {
+    debug_assert!(bytes.len() <= 8);
+    let mut buf = [0u8; 8];
+    buf[..bytes.len()].copy_from_slice(bytes);
+    u64::from_le_bytes(buf)
+}
+
+/// Append `value` to `out` as unsigned LEB128.
+pub(crate) fn write_uleb128(out: &mut Vec<u8>, mut value: u64) {

Review Comment:
   leb128 is likely used else-where in arrow... maybe this is already or can be 
a shared utility. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to