Re: [I] Implement selective decoding of a subset (e.g. columns or row groups) of parquet metadata [arrow-rs]

via GitHub Fri, 07 Jun 2024 07:27:34 -0700


XiangpengHao commented on issue #5855:
URL: https://github.com/apache/arrow-rs/issues/5855#issuecomment-2154960257

I thought about an alternative (but similar) approach to [Pinterest's
solution](https://medium.com/pinterest-engineering/improving-data-processing-efficiency-using-partial-deserialization-of-thrift-16bc3a4a38b4)
-- instead of decoding and building in-memory structs along the way, we can
decouple it and make it two passes.

In the first pass, decode the thrift metadata but do not build the in-memory
structures (i.e., no memory allocation, etc). Instead, we only track the
location of the important structures. Specifically, instead of building the
column chunk structs as we see it, we track the location of the column chunk
(offset to the buffer).

In the second pass, we build the actual in-memory data structure on demand,
using the offset tracked in the first pass.

This approach has the advantage of selective decoding (faster, lower memory
consumption, etc.) and does not need to change the decoding API (unlike the
Pinterest approach). However, it is suboptimal if we actually need to decode
the entire metadata, in which case the first pass is pure overhead. Presuming
that machine learning workloads (or, in general, wide tables) present high
selectivity, we should still save quite a lot.

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [I] Implement selective decoding of a subset (e.g. columns or row groups) of parquet metadata [arrow-rs]

Reply via email to