JakeDern opened a new issue, #9366: URL: https://github.com/apache/arrow-rs/issues/9366
**Describe the bug** Concatenating two dictionary arrays with total cardinality equal to the maximum possible for the key size panics for some value types. For example, concatenating two `Dictionary<u8, FixedSizeBinary(8)>` where each has 128 distinct values incorrectly panics with `DictionaryKeyOverflowError` when doing the same with `Dictionary<u8, u32>` does not. I believe the error happens on this line, which is only hit for certain value types: https://github.com/apache/arrow-rs/blob/fb775011f9e98f7eb84c8df006f8bd9e040ec505/arrow-data/src/transform/mod.rs#L197C1-L198C1. It looks like it's tracking the possible _next_ offset and trying to cast it into the current key type which would be 128 (current offset) + 128 (current length) = 256 which doesn't fit into u8. **To Reproduce** ```rust #[test] fn test_dict_overflow() { use arrow::array::{DictionaryArray, FixedSizeBinaryArray, UInt8Array}; use arrow::buffer::Buffer; use arrow::compute::kernels::concat; use arrow::datatypes::{DataType, Field, Schema}; use std::sync::Arc; let schema = Arc::new(Schema::new(vec![Field::new( "a", DataType::Dictionary( Box::new(DataType::UInt8), Box::new(DataType::FixedSizeBinary(8)), ), false, )])); let keys1 = UInt8Array::from_iter_values(0..128); let values1: Vec<_> = (0u64..128u64).flat_map(|i| i.to_le_bytes()).collect(); let buffer = Buffer::from_vec(values1); let array = FixedSizeBinaryArray::try_new(8, buffer, None).unwrap(); let array = DictionaryArray::try_new(keys1, Arc::new(array)).unwrap(); let batch1 = RecordBatch::try_new(schema.clone(), vec![Arc::new(array)]).unwrap(); let keys2 = UInt8Array::from_iter_values(0..128); let values2: Vec<_> = (128u64..256u64).flat_map(|i| i.to_le_bytes()).collect(); let buffer = Buffer::from_vec(values2); let array = FixedSizeBinaryArray::try_new(8, buffer, None).unwrap(); let array = DictionaryArray::try_new(keys2, Arc::new(array)).unwrap(); let batch2 = RecordBatch::try_new(schema.clone(), vec![Arc::new(array)]).unwrap(); _ = concat::concat_batches(&schema, &[batch1, batch2]).unwrap(); } ``` **Expected behavior** Dictionary should construct just fine as a u8 can handle 256 unique values, and if you run the same test with a different value type like u32 it works. **Additional context** I also observed this with u16 keys and would presumably happen with other integer key types as well. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
