brancz opened a new issue, #6710:
URL: https://github.com/apache/arrow-rs/issues/6710
**Describe the bug**
Not preserving the dict ID does not work with regular IPC files (streaming
works fine). This is because in the file writer, the schema is serialized
twice, however, in the current implementation the same dictionary tracker is
used for both iterations. This is not an issue when dictionary IDs are
preserved, because in that case the dictionary tracker just passes through
whatever is the in dict_id field in the `Field`, however, when not preserving
the dict ID, it continues to assign new dict IDs that don't actually make any
sense, and serializes the footer with incorrect dict IDs because of that.
**To Reproduce**
Write any record that contains at least one dictionary to a file writer that
is configured to not preserve dict IDs.
```
let inner: DictionaryArray<Int32Type> = vec!["a", "b",
"a"].into_iter().collect();
let array = Arc::new(inner) as ArrayRef;
let dctfield = Arc::new(Field::new("dict",
array.data_type().clone(), false));
let s = StructArray::from(vec![(dctfield, array)]);
let struct_array = Arc::new(s) as ArrayRef;
let schema = Arc::new(Schema::new(vec![Field::new(
"struct",
struct_array.data_type().clone(),
false,
)]));
let batch = RecordBatch::try_new(schema,
vec![struct_array]).unwrap();
let mut buf = Vec::new();
let mut writer = crate::writer::FileWriter::try_new_with_options(
&mut buf,
batch.schema_ref(),
IpcWriteOptions::default().with_preserve_dict_id(false),
)
.unwrap();
writer.write(&batch).unwrap();
writer.finish().unwrap();
drop(writer);
let mut reader = FileReader::try_new(std::io::Cursor::new(buf),
None).unwrap();
assert_eq!(batch, reader.next().unwrap().unwrap());
```
**Expected behavior**
Writing a record batch to an IPC file that contains a dict and not
preserving dict IDs works.
**Additional context**
I haven't studied the spec in detail, but it does seem odd to me that the
schema is written twice to the IPC file (once as the first message, and once in
the footer), however, at least the way it stands, this can't be changed,
because the dict IDs need to be assigned before writing the first record batch,
so this can only be changed once the preserve dict ID setting is removed
because dict IDs are never preserved.
The fix is very simple, simply create a new dictionary tracker with the same
configuration as the first time when the schema is written for the second time.
It's a 3 line fix that I already have, but I wanted to make sure to open this
issue for tracking purposes.
@tustvold @alamb
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]