[GitHub] [arrow-rs] stevenliebregt opened a new issue, #1627: Written Parquet file way bigger than input files

GitBox Thu, 28 Apr 2022 07:57:39 -0700


stevenliebregt opened a new issue, #1627:
URL: https://github.com/apache/arrow-rs/issues/1627


   **Which part is this question about**
   The parquet file writer usage
   
   **Describe your question**
   Hi, I'm looking if parqet and arrow could fit a usecase of mine, but I've 
ran into a strange issue, for which I can find now answer in the documentation. 
I have two input files in txt format, where each record spans 4 lines. I have a 
parser that reads that just fine, and want to convert that format to a parquet 
file. The two input files are combined around 600MB, but when I write these to 
a parquet file, the resulting file is nearly 5GB, it also consumes around 6/7GB 
memory while writing the files. I have turned on compression.
   
   ```rust
   let message_type = "
       message Schema {
           REQUIRED BINARY id (UTF8);
           REQUIRED BINARY header (UTF8);
           REQUIRED BINARY sequence (UTF8);
           REQUIRED BINARY quality (UTF8);
       }
   ";
   
   let schema = Arc::new(parse_message_type(message_type).unwrap());
   let props = Arc::new(
       WriterProperties::builder()
           .set_compression(Compression::SNAPPY)
           .build(),
   );
   ```
   
   My rust configuration for the writer.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow-rs] stevenliebregt opened a new issue, #1627: Written Parquet file way bigger than input files

Reply via email to