albertlockett opened a new issue, #8339:
URL: https://github.com/apache/arrow-rs/issues/8339

   **Is your feature request related to a problem or challenge? Please describe 
what you are trying to do.**
   We're writing some code that combines multiple `RecordBatch`s using the 
`BatchCoalescer`.
   
   Before we concatenate, we try to compute the correct size dictionary key to 
use for some columns in the `RecordBatch`. For example, if two batches are 
using a u8 key, but the total cardinality of the column would cause the dict to 
overflow, we would switch the column to a dict keyed by u16 before 
concatenating.
   
   To do this, we're inspecting the columns to compute the total cardinality. 
However, we're thinking it might be faster in a lot of cases to just 
optimistically try to concat the `RecordBatch`s, and if it fails, then compute 
the right-sized dict key for the column and retry.
   
   Unfortunately, the `BatchCoalescer` returns an opaque error and we can't 
tell for which column the dictionary overflowed.
   
   ```rs
           let schema = Arc::new(Schema::new(vec![
               Field::new("a", DataType::Dictionary(Box::new(DataType::UInt8), 
Box::new(DataType::Utf8)), true),
               Field::new("b", DataType::Dictionary(Box::new(DataType::UInt8), 
Box::new(DataType::Utf8)), true),
           ]));
           
           let mut dict1_avals = vec![];
           let mut dict1_akeys = vec![];
           for i in 0..256 {
               dict1_akeys.push(i as u8);
               dict1_avals.push(format!("{i}"))
           }
           let rb1 = RecordBatch::try_new(schema.clone(), vec![
               Arc::new(DictionaryArray::new(
                   UInt8Array::from_iter_values(dict1_akeys), 
                   Arc::new(StringArray::from_iter_values(dict1_avals))
               )),
               Arc::new(DictionaryArray::new(
                   UInt8Array::from_iter_values(vec![0;256]), 
                   Arc::new(StringArray::from_iter_values(vec!["b"]))
               )),
           ]).unwrap();
           
           let rb2 = RecordBatch::try_new(schema.clone(), vec![
               Arc::new(DictionaryArray::new(
                   UInt8Array::from_iter_values(vec![0]), 
                   Arc::new(StringArray::from_iter_values(vec!["a"])),
               )),
               Arc::new(DictionaryArray::new(
                   UInt8Array::from_iter_values(vec![0]), 
                   Arc::new(StringArray::from_iter_values(vec!["b"]))
               )),
           ]).unwrap();
           let mut batcher = arrow::compute::BatchCoalescer::new(
               schema.clone(), rb1.num_rows() + rb2.num_rows()
           );
           batcher.push_batch(rb1).unwrap();
           batcher.push_batch(rb2).unwrap(); // panics with 
`DictionaryKeyOverflowError`
           batcher.finish_buffered_batch().unwrap();
           let result = batcher.next_completed_batch().unwrap();
   ```
   
   **Describe the solution you'd like**
   It would be nice if the error contained context about which column failed to 
concatenate.
   
   I think we could maybe augment the error returned from 
`InProgressArray::finish` here to include the column name:
   
https://github.com/apache/arrow-rs/blob/aa626e12de8bc0d0f56b5349239cae1be8d1a195/arrow-select/src/coalesce.rs#L488-L496
   
   **Describe alternatives you've considered**
   <!--
   A clear and concise description of any alternative solutions or features 
you've considered.
   -->
   
   **Additional context**
   <!--
   Add any other context or screenshots about the feature request here.
   -->
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Reply via email to