Re: [I] Allow per-column parquet dictionary page size limit [arrow-rs]

via GitHub Tue, 01 Jul 2025 13:52:43 -0700


alamb commented on issue #7723:
URL: https://github.com/apache/arrow-rs/issues/7723#issuecomment-3025472832


   > Sorry, I did not get this sentence. Could you elaborate more? Any details 
about the system architecture could have nice stories, and I don’t want to miss 
them.
   
   I am trying to add an additional theory about why no one yet seems to have 
spent as much time as you are considering doing parquet optimization . 
Basically i am trying to write the introduction / justification for an academic 
paper on this topic 😆 
   
   My theory is that  monolithic ("shared nothing") architectures , need to 
trade a fixed set of resource usage between ingest, query and data 
reorganization (compaction, etc). Thus your resource budget for high intensity 
storage optimization is far more constrained.
   
   However, in disaggregated architectures you have the flexibility to scale up 
memory/cpu resources for the reorganization process (and turn it off when not 
needed), which makes it more reasonable to throw an additional order of 
magnitude of CPU/memory on the data reorganization problem (aka optimizing 
parquet files). 
   
   Not sure if that makes sense. 
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [I] Allow per-column parquet dictionary page size limit [arrow-rs]

Reply via email to