Re: [I] Report / blog on parquet metadata sizes for "large" (1000+) numbers of columns [arrow-rs]

via GitHub Thu, 16 May 2024 16:28:17 -0700


tustvold commented on issue #5770:
URL: https://github.com/apache/arrow-rs/issues/5770#issuecomment-2116370344


   So I created a toy benchmark of a 10,000 column parquet file, with f32 
columns, no statistics and 10 row groups.
   
   **Default Config**
   
   The file metadata for this comes to a chunky 11MB, and it takes the #5777 
thrift decoder on 45ms to parse this thrift payload.
   
   If we drop the column count to 1000, the metadata drops to 1MB and parses in 
~3.7ms.
   
   If we also drop the row group count to 1, the metadata drops to 100KB and 
parses in 460 us.
   
   So at a very rough level the cost is 10 bytes and ~450ns per column chunk
   
   **Drop ColumnMetadata**
   
   Dropping the ColumnMetadata from ColumnChunk drops the size down to 3.6MB 
and the parsing speed down to 23ms
   
   **Drop ColumnChunk**
   
   Dropping the columns drops the size down to 109KB and the parsing speed down 
to 718 us. 
   
   At this point we have effectively dropped everything apart from the schema.
   
   **Arrow Schema IPC**
   
   Now for comparison, I encoded a similar schema using arrow IPC to a 
flatbuffer. This came to a still pretty chunky 500KB. Validating the offsets in 
this flatbuffer takes ~1ms.
   
   **Thoughts**
   
   * There isn't anything wrong with using thrift, it is competitive if not 
faster than flatbuffers (when doing full offset validation)
   * The cost of encoding `ColumnChunk` dominates and is `O(num_row_groups * 
num_columns)`
   * Having a single row group drops the latency down to 2ms even with 10,000 
columns. Even if each column only has a single 1MB page, the file will be >10GB 
even with a single row group
   
   Phrasing the above differently, assuming single page ColumnChunks of 1MB, we 
can parse the metadata for a column chunk in 450ns. This means for a very large 
10GB file we can still parse the metadata within single-digit milliseconds.
   
   
    
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [I] Report / blog on parquet metadata sizes for "large" (1000+) numbers of columns [arrow-rs]

Reply via email to