asuresh8 opened a new pull request, #9520:
URL: https://github.com/apache/arrow-rs/pull/9520

   ## Which issue does this PR close?
   
   Related to the optimization opportunity noted in 
https://github.com/apache/arrow-rs/pull/2322 (specifically [this 
comment](https://github.com/apache/arrow-rs/pull/2322#discussion_r920798604) by 
@tustvold suggesting that DictionaryArray inputs could avoid re-hashing 
dictionary values).
   
   ## Rationale for this change
   
   When writing a `DictionaryArray` column with dictionary encoding enabled, 
the current `ByteArrayEncoder` interns every row's string value individually 
via `self.interner.intern(value.as_ref())`. This performs **O(N) hash 
operations** where N is the number of rows -- even though the input 
`DictionaryArray` already has all unique values pre-indexed with only D unique 
dictionary entries (D << N for low-cardinality data).
   
   This is wasteful for the common case of categorical columns (status codes, 
country codes, enum-like strings, etc.) where D might be 5-50 while N is 
thousands or millions.
   
   **Why fewer hashes matters**: Each `intern()` call performs a full ahash 
computation of the byte slice plus a hashbrown table probe. For a column with 
15 unique values across 4096 rows, the existing code performs 4,096 hash+lookup 
operations. With this optimization, it performs only 15 hash+lookup operations 
plus 4,096 simple `Vec` index lookups -- effectively free compared to hashing.
   
   The cost scales linearly with N. At millions of rows (common in analytics 
workloads), the redundant hashing becomes the dominant cost of dictionary 
encoding for low-cardinality string columns.
   
   ## What changes are included in this PR?
   
   A new `encode_with_remap` method on `DictEncoder` that builds a **lazy remap 
table** of size O(D):
   
   1. Allocate a `Vec<Option<u64>>` of length D (the dictionary size)
   2. For each row, extract the dictionary key (a simple integer index lookup)
   3. Check `remap[key]`: if `Some(interned_id)`, use the cached value (no 
hash). If `None`, intern the value once (one hash), cache the result in 
`remap[key]`
   4. Each unique dictionary value is interned exactly once
   
   The optimization activates in `write_gather` when all of these hold:
   - The input is a `DictionaryArray` (checked via `data_type()`)
   - Dictionary encoding is enabled (`self.dict_encoder.is_some()`)
   - The dictionary is low-cardinality: `D <= N/2` (the remap table overhead is 
not worthwhile when most keys are unique)
   - No geo statistics accumulator is active (geo stats need per-value 
processing)
   
   When any condition is not met, the code falls through to the existing 
`encode` path with zero overhead.
   
   **Output is byte-identical** to the existing path. The remap table produces 
the same interned IDs in the same order -- it is purely a caching optimization 
that avoids redundant hash operations.
   
   ### Benchmark results
   
   New benchmark `string_dictionary_low_cardinality` -- 4,096 rows, 15 unique 
string values (simulating categorical columns):
   
   | Configuration | Before | After | Change |
   |---|---|---|---|
   | default | 53.1 us | 32.3 us | **-39%** time, +67% throughput |
   | bloom_filter | 85.3 us | 50.9 us | **-40%** time, +70% throughput |
   | parquet_2 | 54.8 us | 34.1 us | **-38%** time, +64% throughput |
   | zstd | 58.4 us | 37.6 us | **-36%** time, +56% throughput |
   | zstd_parquet_2 | 59.3 us | 38.9 us | **-35%** time, +53% throughput |
   
   Existing `string_dictionary` benchmark (high-cardinality, random data): **no 
change in performance detected** (p > 0.05 for all configurations), confirming 
zero regression on inputs where the optimization does not activate.
   
   ## Are these changes tested?
   
   **Unit tests** (in `byte_array.rs`):
   - `test_dict_passthrough_roundtrip` -- basic low-cardinality DictionaryArray 
write+read
   - `test_dict_passthrough_roundtrip_to_plain` -- DictionaryArray input read 
back as plain StringArray
   - `test_dict_passthrough_data_equivalence` -- byte-identical output between 
Dict and plain paths
   - `test_dict_passthrough_null_keys` -- DictionaryArray with null keys
   - `test_dict_passthrough_mixed_batches` -- DictionaryArray then StringArray 
for same column writer
   - `test_dict_passthrough_multiple_row_groups` -- multiple row groups with 
separate dictionaries
   - `test_dict_passthrough_statistics_correctness` -- min/max statistics match 
between Dict and plain paths
   - `test_dict_passthrough_high_cardinality` -- high-cardinality dict with 
small page size limit (fallback path)
   
   **Integration tests** (new file `parquet/tests/arrow_writer_dictionary.rs`):
   - `dictionary_roundtrip_low_cardinality` -- 4,096-row write+read roundtrip 
through public API
   - `dictionary_and_plain_columns_roundtrip` -- mixed DictionaryArray + 
StringArray columns in same batch
   - `dictionary_statistics_match_plain` -- statistics from Dict path match 
plain StringArray path
   - `dictionary_multi_row_group_roundtrip` -- multi-row-group write+read with 
DictionaryArray
   - `dictionary_with_nulls_roundtrip` -- DictionaryArray with null values 
through public API
   
   ## Are there any user-facing changes?
   
   No API changes. This is a transparent performance improvement for any user 
writing `DictionaryArray` columns with dictionary encoding enabled (the 
default). The optimization activates automatically for low-cardinality 
dictionaries and produces byte-identical output.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to