alamb commented on issue #7723:
URL: https://github.com/apache/arrow-rs/issues/7723#issuecomment-3025472832
> Sorry, I did not get this sentence. Could you elaborate more? Any details
about the system architecture could have nice stories, and I don’t want to miss
them.
I am trying to add an additional theory about why no one yet seems to have
spent as much time as you are considering doing parquet optimization .
Basically i am trying to write the introduction / justification for an academic
paper on this topic 😆
My theory is that monolithic ("shared nothing") architectures , need to
trade a fixed set of resource usage between ingest, query and data
reorganization (compaction, etc). Thus your resource budget for high intensity
storage optimization is far more constrained.
However, in disaggregated architectures you have the flexibility to scale up
memory/cpu resources for the reorganization process (and turn it off when not
needed), which makes it more reasonable to throw an additional order of
magnitude of CPU/memory on the data reorganization problem (aka optimizing
parquet files).
Not sure if that makes sense.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]