Github user cloud-fan commented on the issue:
https://github.com/apache/spark/pull/19810
> store_sales is a 1TB-sized table in cloud storage, which is partitioned
by 'location'. The first query, Q1, wants to output several metrics A, B, C for
all stores in all locations. After that, a small team of 3 data scientists
wants to do some causal analysis for the sales in different locations. To avoid
unnecessary I/O and parquet/orc parsing overhead, they want to cache the whole
table in memory in Q1.
Reading your use case, it sounds like you are trying to optimize the case
that data is too large to fit in memory. For that case, if a partition is on
the disk, Spark needs to load the entire partition to memory before filtering
blocks.
It sounds like something can be done better in 3rd party data sources, or
we need to change the Spark core just for a better table cache, which seems
risky.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]