Github user CodingCat commented on the issue:

    https://github.com/apache/spark/pull/19810
  
    @cloud-fan for this case, if the data has been dumped to disk or some 
non-local tasks are started, I/O is involved in addition to the overhead to 
start extra tasks. If all data is in-memory, only the overhead related to 
tasks' launching is there
    
    
    > It sounds like something can be done better in 3rd party data sources, or 
we need to change the Spark core just for a better table cache, which seems 
risky.
    
    Yes, some work can be done in 3rd party data sources, e.g. to avoid parsing 
overhead in parquet, 
    
    Regarding the risk, in the current implementation, I directly modify the 
core part to add a new type of block and make it recognizable by BlockManager. 
The new RDD and dependency implementation is in SQL module. An alternative way 
to do that is implementing this new type block in SQL as well (but it needs 
some small refactoring to make BlockManager open to anything outside of Spark 
Core)
    
    I personally think it's a good feature to add without that much threat to 
existing code


---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to