emcake opened a new issue, #4409:
URL: https://github.com/apache/arrow-rs/issues/4409
**Describe the bug**
When slicing down a record batch to a subset, the batch shows the correct
number of rows. When serializing it via the File IPC writer, the size in bytes
of the 'file' is quite large in comparison to the amount of content. While I
wouldn't expect it to be linear in the size of the table (given overhead and
potentially compression) the results do seem to be quite large, even for one
record.
**To Reproduce**
This test shows serializing a slice, and how large the slice is:
```rust
#[test]
fn encode_list_length() {
let val_inner = Field::new("item", DataType::UInt32, true);
let val_list_field = Field::new("val",
DataType::List(Arc::new(val_inner)), false);
let schema = Arc::new(Schema::new(vec![val_list_field]));
let values = {
let u32 = UInt32Builder::new();
let mut ls = ListBuilder::new(u32);
for i in 0..100000 {
for value in vec![i, i, i] {
ls.values().append_value(value);
}
ls.append(true)
}
ls.finish()
};
let batch = RecordBatch::try_new(Arc::clone(&schema),
vec![Arc::new(values)]).unwrap();
fn serialize_batch(rb: &RecordBatch) -> Vec<u8> {
let mut writer = FileWriter::try_new(Vec::<u8>::new(),
&rb.schema()).unwrap();
writer.write(&rb).unwrap();
writer.finish().unwrap();
let data = writer.into_inner().unwrap();
data
}
let full_batch = serialize_batch(&batch);
println!(
"full batch = {} rows, {} bytes",
batch.num_rows(),
full_batch.len()
);
let sliced = batch.slice(999, 1); // slice out 1 row
assert_eq!(sliced.num_rows(), 1); // confirm only 1 row
let sliced_batch = serialize_batch(&sliced);
println!(
"sliced batch = {} rows, {} bytes",
sliced.num_rows(),
sliced_batch.len()
);
assert!(sliced_batch.len() < (full_batch.len() / 10)); //
serializing 1 row should be significantly smaller than serializing 100000
}
```
Produces:
```
full batch = 100000 rows, 1650646 bytes
sliced batch = 1 rows, 1238150 bytes
```
and fails since the sliced batch is quite large.
**Expected behavior**
The size to serialize a batch of 1 row should be a lot smaller than 100k
rows.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]