[GitHub] [arrow-rs] emcake opened a new issue, #4409: Sliced record batches containing lists produce infeasibly large IPC file

via GitHub Tue, 13 Jun 2023 09:20:20 -0700


emcake opened a new issue, #4409:
URL: https://github.com/apache/arrow-rs/issues/4409


   **Describe the bug**
   When slicing down a record batch to a subset, the batch shows the correct 
number of rows. When serializing it via the File IPC writer, the size in bytes 
of the 'file' is quite large in comparison to the amount of content. While I 
wouldn't expect it to be linear in the size of the table (given overhead and 
potentially compression) the results do seem to be quite large, even for one 
record.
   
   **To Reproduce**
   This test shows serializing a slice, and how large the slice is:
   
   ```rust
       #[test]
       fn encode_list_length() {
           let val_inner = Field::new("item", DataType::UInt32, true);
           let val_list_field = Field::new("val", 
DataType::List(Arc::new(val_inner)), false);
   
           let schema = Arc::new(Schema::new(vec![val_list_field]));
   
           let values = {
               let u32 = UInt32Builder::new();
               let mut ls = ListBuilder::new(u32);
   
               for i in 0..100000 {
                   for value in vec![i, i, i] {
                       ls.values().append_value(value);
                   }
                   ls.append(true)
               }
   
               ls.finish()
           };
   
           let batch = RecordBatch::try_new(Arc::clone(&schema), 
vec![Arc::new(values)]).unwrap();
   
           fn serialize_batch(rb: &RecordBatch) -> Vec<u8> {
               let mut writer = FileWriter::try_new(Vec::<u8>::new(), 
&rb.schema()).unwrap();
               writer.write(&rb).unwrap();
               writer.finish().unwrap();
               let data = writer.into_inner().unwrap();
   
               data
           }
   
           let full_batch = serialize_batch(&batch);
   
           println!(
               "full batch = {} rows, {} bytes",
               batch.num_rows(),
               full_batch.len()
           );
   
           let sliced = batch.slice(999, 1); // slice out 1 row
   
           assert_eq!(sliced.num_rows(), 1); // confirm only 1 row
   
           let sliced_batch = serialize_batch(&sliced);
   
           println!(
               "sliced batch = {} rows, {} bytes",
               sliced.num_rows(),
               sliced_batch.len()
           );
   
           assert!(sliced_batch.len() < (full_batch.len() / 10)); // 
serializing 1 row should be significantly smaller than serializing 100000
       }
   ```
   
   Produces:
   ```
   full batch = 100000 rows, 1650646 bytes
   sliced batch = 1 rows, 1238150 bytes
   ```
   and fails since the sliced batch is quite large.
   
   **Expected behavior**
   The size to serialize a batch of 1 row should be a lot smaller than 100k 
rows.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow-rs] emcake opened a new issue, #4409: Sliced record batches containing lists produce infeasibly large IPC file

Reply via email to