pchintar opened a new pull request, #9776:
URL: https://github.com/apache/arrow-rs/pull/9776
# Which issue does this PR close?
- Closes #9775 .
# Rationale for this change
Delta dictionaries are currently handled in the reader as:
```rust
let combined = concat::concat(&[existing, &dict_values])?;
dictionaries_by_id.insert(dict_id, combined);
```
This executes for every `DictionaryBatch` with `isDelta = true`, even before
any `RecordBatch` consumes the dictionary. So, for a sequence of delta updates:
```text
delta 1 -> copy 1
delta 2 -> copy 2
delta 3 -> copy 3
...
```
Total work:
```text
O(1 + 2 + 3 + ... + N) = O(N²)
```
where `N` is the number of accumulated dictionary elements.
# What changes are included in this PR?
Defer dictionary materialization until it is required for record batch
decoding.
Maintain:
```rust
HashMap<i64, ArrayRef> // materialized dictionaries
HashMap<i64, Vec<ArrayRef>> // pending delta chunks
```
Behavior:
* Non-delta dictionary
* replace the materialized dictionary
* clear pending deltas for that id
* Delta dictionary
* append the delta chunk to the pending list
* do not concatenate immediately
* Before `RecordBatch` decode
* concatenate:
```text
existing + pending deltas
```
* update the materialized dictionary map
* clear the pending list
---
## Complexity Improvement
### Before
```text
Per delta:
concat(existing, delta)
Total:
O(1 + 2 + 3 + ... + N) = O(N²)
```
### After
```text
Per delta:
append reference
Before decode:
single concat
Total:
O(N)
```
# Are these changes tested?
### Modified components:
* `arrow-ipc/src/reader.rs`
* `StreamReader`
* `FileDecoder`
* `arrow-ipc/src/reader/stream.rs`
* `StreamDecoder`
### Tests:
* Added:
* `arrow-ipc/src/tests/delta_dictionary.rs`
* `test_multiple_deltas_before_record_batch`
* Ensures correct behavior when multiple delta updates occur before record
batch decoding
* Benchmarks and examples updated to handle the `&mut self` requirement
introduced by `FileDecoder::read_record_batch`
-->
# Are there any user-facing changes?
Internal only. No public API or IPC format changes.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]