[GitHub] [arrow-rs] jdroenner opened a new issue, #4724: Size of (IPC) decompressed data can be larger then the original data

via GitHub Tue, 22 Aug 2023 07:58:22 -0700


jdroenner opened a new issue, #4724:
URL: https://github.com/apache/arrow-rs/issues/4724


   (Don't really know if this is a bug or a feature)
   
   **Describe the bug**
   We have a use-case where we want to store Arrow data (`StructArray`) 
compressed in memory. Currently the best way for this appears to be the IPC 
`FileWriter` with compression and then storing the compressed bytes. For our 
use-case it is important to know that the uncompressed data has the same size 
in memory as the original input.
   
   Using the `FileReader` we noticed that our assumption, that the decompressed 
data has the same size as the original input, does not hold.  There is a case 
that can cause the memory footprint of the decompressed data to become ~ 
"decompressed bytes + compressed bytes".
   
   If `CompressionCodec` `decompress_to_buffer` detects an uncompressed slice 
of data (`decompressed_length == LENGTH_NO_COMPRESSED_DATA`) it returns a new 
`Buffer` that uses a pointer into the `Buffer` that holds the raw bytes 
produced by the reader. This causes the raw `Buffer` to be kept in memory. 
Additionally the compressed data slices are decompressed into new `Buffers` 
that use newly allocated memory.
   
   **To Reproduce**
   The easiest way is to create a `StructArray` with one column that is good to 
compress and one that is not:
   
   
   ```rust
   use std::{io::Cursor, sync::Arc};
   
   use arrow::{
       array::{Array, ArrayRef, Int32Array, StringArray, StructArray},
       datatypes::{DataType, Field, Fields},
       ipc::{
           writer::{FileWriter, IpcWriteOptions},
           CompressionType,
       },
       record_batch::RecordBatch,
   };
   
   fn main() {
       // Setup StructArray with two columns
       let primes = Arc::new(Int32Array::from(vec![
           31, 37, 41, 43, 47, 53, 59, 61, 67, 71, 73, 79,
       ]));
       let strings = Arc::new(StringArray::from(vec![
           "Hello world one",
           "Hello world two",
           "Hello world three",
           "Hello world four",
           "Hello world five",
           "Hello world six",
           "Hello world seven",
           "Hello world eight",
           "Hello world nine",
           "Hello world ten",
           "Hello world eleven",
           "Hello world twelve",
       ]));
       let table = StructArray::new(
           Fields::from(vec![
               Field::new("primes", DataType::Int32, false),
               Field::new("strings", DataType::Utf8, false),
           ]),
           vec![primes as ArrayRef, strings as ArrayRef],
           None,
       );
       // Print the size of the original data
       println!("Original size: {:?}", table.get_array_memory_size());
   
       // Write to file with compression
       let record_batch = RecordBatch::from(&table);
       let mut file_writer = FileWriter::try_new_with_options(
           Vec::new(),
           record_batch.schema().as_ref(),
           IpcWriteOptions::default()
               .try_with_compression(Some(CompressionType::LZ4_FRAME))
               .unwrap(),
       )
       .unwrap();
       file_writer.write(&record_batch).unwrap();
       file_writer.finish().unwrap();
       let bytes = file_writer.into_inner().unwrap();
       // Print the size of the compressed data
       println!("Compressed size: {:?}", bytes.len());
   
       // Read from file and decompress
       let mut reader = 
arrow::ipc::reader::FileReader::try_new(Cursor::new(bytes), None).unwrap();
       let record_batch = reader.next().unwrap().unwrap();
       let sr = StructArray::from(record_batch);
       // Print the size of the decompressed data
       println!("Decompressed size: {:?}", sr.get_array_memory_size());
   }
   
   ```
   The output will be something like this:
   ```bash
   Original size: 688
   Compressed size: 950
   Decompressed size: 1104
   ```
   
   
   **Expected behavior**
   We expected the decompressed data to use the same amount of memory as the 
original data.
   An easy way to fix this would be to change the "uncompressed in compressed" 
case in `decompress_to_buffer` to return a real copy of the uncompressed data 
slice.
   The easy fix, might cause a lot of memcopys in cases where compression was 
preferred but  a lot of data could not be compressed. To make the behavior a 
user choice, we could add a "copy_uncompressed" parameter to the 
`decompress_to_buffer` method. Since the user only has access to readers, this 
should probably be set as a reader option that is passed down to the 
decompression step...
   
   **Additional context**
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow-rs] jdroenner opened a new issue, #4724: Size of (IPC) decompressed data can be larger then the original data

Reply via email to