asuresh8 opened a new pull request, #9520: URL: https://github.com/apache/arrow-rs/pull/9520
## Which issue does this PR close? Related to the optimization opportunity noted in https://github.com/apache/arrow-rs/pull/2322 (specifically [this comment](https://github.com/apache/arrow-rs/pull/2322#discussion_r920798604) by @tustvold suggesting that DictionaryArray inputs could avoid re-hashing dictionary values). ## Rationale for this change When writing a `DictionaryArray` column with dictionary encoding enabled, the current `ByteArrayEncoder` interns every row's string value individually via `self.interner.intern(value.as_ref())`. This performs **O(N) hash operations** where N is the number of rows -- even though the input `DictionaryArray` already has all unique values pre-indexed with only D unique dictionary entries (D << N for low-cardinality data). This is wasteful for the common case of categorical columns (status codes, country codes, enum-like strings, etc.) where D might be 5-50 while N is thousands or millions. **Why fewer hashes matters**: Each `intern()` call performs a full ahash computation of the byte slice plus a hashbrown table probe. For a column with 15 unique values across 4096 rows, the existing code performs 4,096 hash+lookup operations. With this optimization, it performs only 15 hash+lookup operations plus 4,096 simple `Vec` index lookups -- effectively free compared to hashing. The cost scales linearly with N. At millions of rows (common in analytics workloads), the redundant hashing becomes the dominant cost of dictionary encoding for low-cardinality string columns. ## What changes are included in this PR? A new `encode_with_remap` method on `DictEncoder` that builds a **lazy remap table** of size O(D): 1. Allocate a `Vec<Option<u64>>` of length D (the dictionary size) 2. For each row, extract the dictionary key (a simple integer index lookup) 3. Check `remap[key]`: if `Some(interned_id)`, use the cached value (no hash). If `None`, intern the value once (one hash), cache the result in `remap[key]` 4. Each unique dictionary value is interned exactly once The optimization activates in `write_gather` when all of these hold: - The input is a `DictionaryArray` (checked via `data_type()`) - Dictionary encoding is enabled (`self.dict_encoder.is_some()`) - The dictionary is low-cardinality: `D <= N/2` (the remap table overhead is not worthwhile when most keys are unique) - No geo statistics accumulator is active (geo stats need per-value processing) When any condition is not met, the code falls through to the existing `encode` path with zero overhead. **Output is byte-identical** to the existing path. The remap table produces the same interned IDs in the same order -- it is purely a caching optimization that avoids redundant hash operations. ### Benchmark results New benchmark `string_dictionary_low_cardinality` -- 4,096 rows, 15 unique string values (simulating categorical columns): | Configuration | Before | After | Change | |---|---|---|---| | default | 53.1 us | 32.3 us | **-39%** time, +67% throughput | | bloom_filter | 85.3 us | 50.9 us | **-40%** time, +70% throughput | | parquet_2 | 54.8 us | 34.1 us | **-38%** time, +64% throughput | | zstd | 58.4 us | 37.6 us | **-36%** time, +56% throughput | | zstd_parquet_2 | 59.3 us | 38.9 us | **-35%** time, +53% throughput | Existing `string_dictionary` benchmark (high-cardinality, random data): **no change in performance detected** (p > 0.05 for all configurations), confirming zero regression on inputs where the optimization does not activate. ## Are these changes tested? **Unit tests** (in `byte_array.rs`): - `test_dict_passthrough_roundtrip` -- basic low-cardinality DictionaryArray write+read - `test_dict_passthrough_roundtrip_to_plain` -- DictionaryArray input read back as plain StringArray - `test_dict_passthrough_data_equivalence` -- byte-identical output between Dict and plain paths - `test_dict_passthrough_null_keys` -- DictionaryArray with null keys - `test_dict_passthrough_mixed_batches` -- DictionaryArray then StringArray for same column writer - `test_dict_passthrough_multiple_row_groups` -- multiple row groups with separate dictionaries - `test_dict_passthrough_statistics_correctness` -- min/max statistics match between Dict and plain paths - `test_dict_passthrough_high_cardinality` -- high-cardinality dict with small page size limit (fallback path) **Integration tests** (new file `parquet/tests/arrow_writer_dictionary.rs`): - `dictionary_roundtrip_low_cardinality` -- 4,096-row write+read roundtrip through public API - `dictionary_and_plain_columns_roundtrip` -- mixed DictionaryArray + StringArray columns in same batch - `dictionary_statistics_match_plain` -- statistics from Dict path match plain StringArray path - `dictionary_multi_row_group_roundtrip` -- multi-row-group write+read with DictionaryArray - `dictionary_with_nulls_roundtrip` -- DictionaryArray with null values through public API ## Are there any user-facing changes? No API changes. This is a transparent performance improvement for any user writing `DictionaryArray` columns with dictionary encoding enabled (the default). The optimization activates automatically for low-cardinality dictionaries and produces byte-identical output. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
