XiangpengHao commented on issue #5855: URL: https://github.com/apache/arrow-rs/issues/5855#issuecomment-2154960257
I thought about an alternative (but similar) approach to [Pinterest's solution](https://medium.com/pinterest-engineering/improving-data-processing-efficiency-using-partial-deserialization-of-thrift-16bc3a4a38b4) -- instead of decoding and building in-memory structs along the way, we can decouple it and make it two passes. In the first pass, decode the thrift metadata but do not build the in-memory structures (i.e., no memory allocation, etc). Instead, we only track the location of the important structures. Specifically, instead of building the column chunk structs as we see it, we track the location of the column chunk (offset to the buffer). In the second pass, we build the actual in-memory data structure on demand, using the offset tracked in the first pass. This approach has the advantage of selective decoding (faster, lower memory consumption, etc.) and does not need to change the decoding API (unlike the Pinterest approach). However, it is suboptimal if we actually need to decode the entire metadata, in which case the first pass is pure overhead. Presuming that machine learning workloads (or, in general, wide tables) present high selectivity, we should still save quite a lot. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
