pchintar opened a new issue, #9775:
URL: https://github.com/apache/arrow-rs/issues/9775

   ## Description
   
   The current IPC reader eagerly materializes dictionary deltas, which can 
lead to quadratic copying behavior when dictionaries are updated incrementally.
   
   In `reader.rs`, delta dictionaries are currently handled as:
   
   ```rust
   let combined = concat::concat(&[existing, &dict_values])?;
   dictionaries_by_id.insert(dict_id, combined);
   ```
   
   This is invoked for every `DictionaryBatch` with `isDelta = true`, before 
any record batch consumes the dictionary.
   
   ---
   
   ## Problem
   
   For a sequence of delta updates, the reader repeatedly concatenates the full 
existing dictionary with each new delta. This results in cumulative copying 
proportional to:
   
   ```
   O(1 + 2 + 3 + ... + N) = O(N²)
   ```
   
   where N is the number of accumulated dictionary elements.
   
   This behavior is not theoretical. The writer supports delta dictionaries 
(`DictionaryHandling::Delta`), and the reader processes them eagerly on every 
update. As a result, large or frequently updated dictionaries can incur 
significant and unnecessary memory and CPU overhead.
   
   ---
   
   ## Root Cause
   
   The reader performs eager materialization of the full dictionary on every 
delta update, even though only the final accumulated dictionary is required at 
record batch decode time.
   
   ---
   
   ## Proposed Solution
   
   Defer dictionary materialization until it is required for record batch 
decoding.
   
   Specifically:
   
   * Maintain the existing `HashMap<i64, ArrayRef>` for materialized 
dictionaries.
   * Introduce a separate internal structure to store pending delta chunks:
   
     ```
     HashMap<i64, Vec<ArrayRef>>
     ```
   * On delta dictionary updates:
   
     * append the delta chunk to the pending list
     * do not perform concatenation
   * Before decoding a `RecordBatch`:
   
     * materialize the dictionary by concatenating the existing dictionary with 
all pending chunks
     * update the materialized dictionary map
     * clear the pending chunks
   
   ---
   
   ## Expected Impact
   
   This changes the accumulation cost from:
   
   * repeated full dictionary concatenation per delta (quadratic behavior)
   
   to:
   
   * append-only delta accumulation with a single concatenation per decode 
(linear behavior)
   
   This reduces:
   
   * memory copying
   * allocation overhead
   * CPU time for dictionary-heavy workloads


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to