[GitHub] [arrow-datafusion] alamb commented on issue #7200: RowConverter keeps growing in size while merging streams on high-cardinality dictionary fields

via GitHub Thu, 24 Aug 2023 11:21:43 -0700


alamb commented on issue #7200:
URL: 
https://github.com/apache/arrow-datafusion/issues/7200#issuecomment-1692201194


   @JayjeetAtGithub I was thinking about this issue after some analysis I did 
on https://github.com/influxdata/influxdb_iox/issues/8568. I think my 
observation is that the `RowConverter` memory consumption explodes for high 
cardinality dictionaries wherever it is used,  *wherever* it is used (not just 
in merge). Now that I type it out, it seems obvious 😆 
   
   Thus it seems like it might be a good patten to encapsulate / reuse the 
logic with some sort of wrapper around the row converter. Maybe something like:
   
   ```rust
   /// wrapper around a Row converter that automatically
   /// picks appropriate dictionary encoding
   struct DataFusionRowConverter { 
     inner: Option<RowEncoder>
   }
   
   impl DataFusionRowConverter {
     pub fn convert_columns(
       &mut self,
       columns: &[ArrayRef]
     ) -> Result<Rows, ArrowError> {
       if self.inner.is_none() {
        // Check the arrays, detect high cardinality dictionaries
        // and fallback to normal decoding for that case
      }
      // after the first batch, use the pre-configured row coverter
      self.inner.as_mut().unwrap().convert_columns(columns)
   }
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow-datafusion] alamb commented on issue #7200: RowConverter keeps growing in size while merging streams on high-cardinality dictionary fields

Reply via email to