alamb commented on issue #7200: URL: https://github.com/apache/arrow-datafusion/issues/7200#issuecomment-1692201194
@JayjeetAtGithub I was thinking about this issue after some analysis I did on https://github.com/influxdata/influxdb_iox/issues/8568. I think my observation is that the `RowConverter` memory consumption explodes for high cardinality dictionaries wherever it is used, *wherever* it is used (not just in merge). Now that I type it out, it seems obvious 😆 Thus it seems like it might be a good patten to encapsulate / reuse the logic with some sort of wrapper around the row converter. Maybe something like: ```rust /// wrapper around a Row converter that automatically /// picks appropriate dictionary encoding struct DataFusionRowConverter { inner: Option<RowEncoder> } impl DataFusionRowConverter { pub fn convert_columns( &mut self, columns: &[ArrayRef] ) -> Result<Rows, ArrowError> { if self.inner.is_none() { // Check the arrays, detect high cardinality dictionaries // and fallback to normal decoding for that case } // after the first batch, use the pre-configured row coverter self.inner.as_mut().unwrap().convert_columns(columns) } ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
