Github user cloud-fan commented on the issue:

    https://github.com/apache/spark/pull/19810
  
    > store_sales is a 1TB-sized table in cloud storage, which is partitioned 
by 'location'. The first query, Q1, wants to output several metrics A, B, C for 
all stores in all locations. After that, a small team of 3 data scientists 
wants to do some causal analysis for the sales in different locations. To avoid 
unnecessary I/O and parquet/orc parsing overhead, they want to cache the whole 
table in memory in Q1.
    
    Reading your use case, it sounds like you are trying to optimize the case 
that data is too large to fit in memory. For that case, if a partition is on 
the disk, Spark needs to load the entire partition to memory before filtering 
blocks.
    
    It sounds like something can be done better in 3rd party data sources, or 
we need to change the Spark core just for a better table cache, which seems 
risky.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to