[GitHub] [arrow-rs] REASY opened a new issue, #1528: RecordBatch: Serialization of sliced record using StreamWriter produces incorrect resut

GitBox Fri, 08 Apr 2022 01:43:18 -0700


REASY opened a new issue, #1528:
URL: https://github.com/apache/arrow-rs/issues/1528


   When you slice `RecordBatch` and serialize it with `StreamWriter`, it 
produces an incorrect result. I'm using `arrow = "11.1.0"`
   
   To reproduce once can use the following test:
   ```rust
   #[cfg(test)]
   mod tests {
       use std::sync::Arc;
       use arrow::array::{Int32Array, StringArray};
       use arrow::datatypes::{DataType, Field, Schema};
       use arrow::ipc::writer::StreamWriter;
       use arrow::record_batch::RecordBatch;
   
       #[test]
       fn it_works() {
           pub fn serialize(record: &RecordBatch) -> Vec<u8> {
               let buffer: Vec<u8> = Vec::new();
               let mut stream_writer = StreamWriter::try_new(buffer, 
&record.schema()).unwrap();
               stream_writer.write(record).unwrap();
               stream_writer.finish().unwrap();
               let serialized_batch = stream_writer.into_inner().unwrap();
               serialized_batch
           }
   
           fn create_batch(rows: usize) -> RecordBatch {
               let schema = Schema::new(vec![
                   Field::new("a", DataType::Int32, false),
                   Field::new("b", DataType::Utf8, false),
               ]);
               let expected_schema = schema.clone();
   
               let a = Int32Array::from(vec![1; rows]);
               let b = StringArray::from(vec!["a"; rows]);
   
               let record_batch = RecordBatch::try_new(Arc::new(schema), 
vec![Arc::new(a), Arc::new(b)])
                   .unwrap();
               record_batch
           }
           let big_record_batch = create_batch(65536);
           println!("big_record_batch with dimension ({}, {}) (rows x cols) 
serialized as Apache Arrow IPC in {} bytes", big_record_batch.num_rows(),
                    big_record_batch.num_columns(), 
serialize(&big_record_batch).len());
           let length = 5;
           let small_record_batch = create_batch(length);
           println!("small_record_batch with dimension ({}, {}) (rows x cols) 
serialized as Apache Arrow IPC in {} bytes", small_record_batch.num_rows(),
                    small_record_batch.num_columns(), 
serialize(&small_record_batch).len());
   
           let offset = 2;
           let record_batch_slice = big_record_batch.slice(offset, length);
           println!("(Sliced): record_batch_slice with dimension ({}, {}) (rows 
x cols) serialized as Apache Arrow IPC in {} bytes", 
record_batch_slice.num_rows(),
                    record_batch_slice.num_columns(), 
serialize(&record_batch_slice).len());
       }
   }
   ```
   As you can see the sliced one has almost the same size as 
`big_record_batch`, but I would expect it to be the same size as 
`small_record_batch`:
   ```
   big_record_batch with dimension (65536, 2) (rows x cols) serialized as 
Apache Arrow IPC in 606608 bytes
   small_record_batch with dimension (5, 2) (rows x cols) serialized as Apache 
Arrow IPC in 464 bytes
   (Sliced): record_batch_slice with dimension (5, 2) (rows x cols) serialized 
as Apache Arrow IPC in 590240 bytes
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow-rs] REASY opened a new issue, #1528: RecordBatch: Serialization of sliced record using StreamWriter produces incorrect resut

Reply via email to