matthewmturner commented on issue #208:
URL: https://github.com/apache/arrow-rs/issues/208#issuecomment-917069278
@alamb I think im restating the obvious and what has already been said, but
i want to make sure i understand whats happening so i made a small sample.
```
use arrow::array::{Array, Int32Array};
use arrow::datatypes::{DataType, Field, Schema};
use arrow::record_batch::RecordBatch;
use std::sync::Arc;
pub fn test_record_batch_size() {
let arr_data = vec![1, 2, 3, 4, 5];
let val_data = vec![5, 6, 7, 8, 9];
let id_arr = Int32Array::from(arr_data);
let val_arr = Int32Array::from(val_data);
let id_arr_slice = id_arr.slice(1, 3);
let val_arr_slice = val_arr.slice(1, 3);
let schema = Schema::new(vec![
Field::new("id", DataType::Int32, false),
Field::new("val", DataType::Int32, false),
]);
let batch = RecordBatch::try_new(Arc::new(schema), vec![id_arr_slice,
val_arr_slice]).unwrap();
println!("{:?}", batch);
for column in batch.columns() {
println!("{:?}", column.data());
}
}
```
Produces the following output
```
RecordBatch { schema: Schema { fields: [Field { name: "id", data_type:
Int32, nullable: false, dict_id: 0, dict_is_ordered: false, metadata: None },
Field { name: "val", data_type: Int32, nullable: false, dict_id: 0,
dict_is_ordered: false, metadata: None }], metadata: {} }, columns:
[PrimitiveArray<Int32>
[
2,
3,
4,
], PrimitiveArray<Int32>
[
6,
7,
8,
]] }
ArrayData { data_type: Int32, len: 3, null_count: 0, offset: 1, buffers:
[Buffer { data: Bytes { ptr: 0x149e06c40, len: 20, data: [1, 0, 0, 0, 2, 0, 0,
0, 3, 0, 0, 0, 4, 0, 0, 0, 5, 0, 0, 0] }, offset: 0 }], child_data: [],
null_bitmap: None }
1
ArrayData { data_type: Int32, len: 3, null_count: 0, offset: 1, buffers:
[Buffer { data: Bytes { ptr: 0x149e06d00, len: 20, data: [5, 0, 0, 0, 6, 0, 0,
0, 7, 0, 0, 0, 8, 0, 0, 0, 9, 0, 0, 0] }, offset: 0 }], child_data: [],
null_bitmap: None }
```
And the issue is that the data buffer points to the original larger array.
Then, that larger array is ultimately turned into the `FlightData` which is a
waste.
Assuming that's all correct is there a preference as to where a fix should
be applied? i.e. whether at `flight_data_from_arrow_batch`, `encoded_batch`, or
`record_batch_to_bytes`?
Naively I was thinking at the `record_batch_to_bytes` level - but i think
that might impact IPC in general. Im still figuring out the separation between
IPC and Flight functionality though and if this issue is focused only on
updating how array data is handled for Flight or for IPC in general. If we
wanted it to be closer to the Flight level then i think copying the
`RecordBatch` in `flight_data_from_arrow_batch` before passing it to
`encoded_batch` would be the way.
What do you think?
Separately, ive been looking if there are any methods / helpers for
recreating a `RecordBatch` out of the data / offsets / len of another
`RecordBatch`. Dont think ive found anything though. If thats the case would
the idea be to just remake the batch from scratch with the data from the
original?
Hope that's all clear.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]