tustvold commented on issue #5770:
URL: https://github.com/apache/arrow-rs/issues/5770#issuecomment-2116370344
So I created a toy benchmark of a 10,000 column parquet file, with f32
columns, no statistics and 10 row groups.
**Default Config**
The file metadata for this comes to a chunky 11MB, and it takes the #5777
thrift decoder on 45ms to parse this thrift payload.
If we drop the column count to 1000, the metadata drops to 1MB and parses in
~3.7ms.
If we also drop the row group count to 1, the metadata drops to 100KB and
parses in 460 us.
So at a very rough level the cost is 10 bytes and ~450ns per column chunk
**Drop ColumnMetadata**
Dropping the ColumnMetadata from ColumnChunk drops the size down to 3.6MB
and the parsing speed down to 23ms
**Drop ColumnChunk**
Dropping the columns drops the size down to 109KB and the parsing speed down
to 718 us.
At this point we have effectively dropped everything apart from the schema.
**Arrow Schema IPC**
Now for comparison, I encoded a similar schema using arrow IPC to a
flatbuffer. This came to a still pretty chunky 500KB. Validating the offsets in
this flatbuffer takes ~1ms.
**Thoughts**
* There isn't anything wrong with using thrift, it is competitive if not
faster than flatbuffers (when doing full offset validation)
* The cost of encoding `ColumnChunk` dominates and is `O(num_row_groups *
num_columns)`
* Having a single row group drops the latency down to 2ms even with 10,000
columns. Even if each column only has a single 1MB page, the file will be >10GB
even with a single row group
Phrasing the above differently, assuming single page ColumnChunks of 1MB, we
can parse the metadata for a column chunk in 450ns. This means for a very large
10GB file we can still parse the metadata within single-digit milliseconds.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]