XiangpengHao commented on issue #5855:
URL: https://github.com/apache/arrow-rs/issues/5855#issuecomment-2154960257

   I thought about an alternative (but similar) approach to [Pinterest's 
solution](https://medium.com/pinterest-engineering/improving-data-processing-efficiency-using-partial-deserialization-of-thrift-16bc3a4a38b4)
 -- instead of decoding and building in-memory structs along the way, we can 
decouple it and make it two passes.
   
   In the first pass, decode the thrift metadata but do not build the in-memory 
structures (i.e., no memory allocation, etc). Instead, we only track the 
location of the important structures. Specifically, instead of building the 
column chunk structs as we see it, we track the location of the column chunk 
(offset to the buffer).
   
   In the second pass, we build the actual in-memory data structure on demand, 
using the offset tracked in the first pass.
   
   This approach has the advantage of selective decoding (faster, lower memory 
consumption, etc.) and does not need to change the decoding API (unlike the 
Pinterest approach). However, it is suboptimal if we actually need to decode 
the entire metadata, in which case the first pass is pure overhead. Presuming 
that machine learning workloads (or, in general, wide tables) present high 
selectivity, we should still save quite a lot.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to