brancz opened a new issue, #6710:
URL: https://github.com/apache/arrow-rs/issues/6710

   **Describe the bug**
   
   Not preserving the dict ID does not work with regular IPC files (streaming 
works fine). This is because in the file writer, the schema is serialized 
twice, however, in the current implementation the same dictionary tracker is 
used for both iterations. This is not an issue when dictionary IDs are 
preserved, because in that case the dictionary tracker just passes through 
whatever is the in dict_id field in the `Field`, however, when not preserving 
the dict ID, it continues to assign new dict IDs that don't actually make any 
sense, and serializes the footer with incorrect dict IDs because of that.
   
   **To Reproduce**
   
   Write any record that contains at least one dictionary to a file writer that 
is configured to not preserve dict IDs.
   
   ```
           let inner: DictionaryArray<Int32Type> = vec!["a", "b", 
"a"].into_iter().collect();
   
           let array = Arc::new(inner) as ArrayRef;
   
           let dctfield = Arc::new(Field::new("dict", 
array.data_type().clone(), false));
   
           let s = StructArray::from(vec![(dctfield, array)]);
           let struct_array = Arc::new(s) as ArrayRef;
   
           let schema = Arc::new(Schema::new(vec![Field::new(
               "struct",
               struct_array.data_type().clone(),
               false,
           )]));
   
           let batch = RecordBatch::try_new(schema, 
vec![struct_array]).unwrap();
   
           let mut buf = Vec::new();
           let mut writer = crate::writer::FileWriter::try_new_with_options(
               &mut buf,
               batch.schema_ref(),
               IpcWriteOptions::default().with_preserve_dict_id(false),
           )
           .unwrap();
           writer.write(&batch).unwrap();
           writer.finish().unwrap();
           drop(writer);
   
           let mut reader = FileReader::try_new(std::io::Cursor::new(buf), 
None).unwrap();
   
           assert_eq!(batch, reader.next().unwrap().unwrap());
   ```
   
   **Expected behavior**
   
   Writing a record batch to an IPC file that contains a dict and not 
preserving dict IDs works.
   
   **Additional context**
   
   I haven't studied the spec in detail, but it does seem odd to me that the 
schema is written twice to the IPC file (once as the first message, and once in 
the footer), however, at least the way it stands, this can't be changed, 
because the dict IDs need to be assigned before writing the first record batch, 
so this can only be changed once the preserve dict ID setting is removed 
because dict IDs are never preserved.
   
   The fix is very simple, simply create a new dictionary tracker with the same 
configuration as the first time when the schema is written for the second time. 
It's a 3 line fix that I already have, but I wanted to make sure to open this 
issue for tracking purposes.
   
   @tustvold @alamb 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to