pchintar opened a new pull request, #9776:
URL: https://github.com/apache/arrow-rs/pull/9776

   # Which issue does this PR close?
   
   - Closes #9775 .
   
   # Rationale for this change
   
   Delta dictionaries are currently handled in the reader as:
   
   ```rust
   let combined = concat::concat(&[existing, &dict_values])?;
   dictionaries_by_id.insert(dict_id, combined);
   ```
   
   This executes for every `DictionaryBatch` with `isDelta = true`, even before 
any `RecordBatch` consumes the dictionary. So, for a sequence of delta updates:
   
   ```text
   delta 1 -> copy 1
   delta 2 -> copy 2
   delta 3 -> copy 3
   ...
   ```
   
   Total work:
   
   ```text
   O(1 + 2 + 3 + ... + N) = O(N²)
   ```
   
   where `N` is the number of accumulated dictionary elements.
   
   # What changes are included in this PR?
   
   Defer dictionary materialization until it is required for record batch 
decoding.
   Maintain:
   
   ```rust
   HashMap<i64, ArrayRef>          // materialized dictionaries
   HashMap<i64, Vec<ArrayRef>>     // pending delta chunks
   ```
   
   Behavior:
   
   * Non-delta dictionary
   
     * replace the materialized dictionary
     * clear pending deltas for that id
   
   * Delta dictionary
   
     * append the delta chunk to the pending list
     * do not concatenate immediately
   
   * Before `RecordBatch` decode
   
     * concatenate:
   
       ```text
       existing + pending deltas
       ```
     * update the materialized dictionary map
     * clear the pending list
   
   ---
   
   ## Complexity Improvement
   
   ### Before
   
   ```text
   Per delta:
       concat(existing, delta)
   
   Total:
       O(1 + 2 + 3 + ... + N) = O(N²)
   ```
   
   ### After
   
   ```text
   Per delta:
       append reference
   
   Before decode:
       single concat
   
   Total:
       O(N)
   ```
   
   # Are these changes tested?
   
   ### Modified components:
   
   * `arrow-ipc/src/reader.rs`
   
     * `StreamReader`
     * `FileDecoder`
   * `arrow-ipc/src/reader/stream.rs`
   
     * `StreamDecoder`
   
   ### Tests:
   
   * Added:
   
     * `arrow-ipc/src/tests/delta_dictionary.rs`
   
       * `test_multiple_deltas_before_record_batch`
   * Ensures correct behavior when multiple delta updates occur before record 
batch decoding
   * Benchmarks and examples updated to handle the `&mut self` requirement 
introduced by `FileDecoder::read_record_batch`
   -->
   
   # Are there any user-facing changes?
   
   Internal only. No public API or IPC format changes.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to