Github user CodingCat commented on the issue:
https://github.com/apache/spark/pull/19810
@cloud-fan for this case, if the data has been dumped to disk or some
non-local tasks are started, I/O is involved in addition to the overhead to
start extra tasks. If all data is in-memory, only the overhead related to
tasks' launching is there
> It sounds like something can be done better in 3rd party data sources, or
we need to change the Spark core just for a better table cache, which seems
risky.
Yes, some work can be done in 3rd party data sources, e.g. to avoid parsing
overhead in parquet,
Regarding the risk, in the current implementation, I directly modify the
core part to add a new type of block and make it recognizable by BlockManager.
The new RDD and dependency implementation is in SQL module. An alternative way
to do that is implementing this new type block in SQL as well (but it needs
some small refactoring to make BlockManager open to anything outside of Spark
Core)
I personally think it's a good feature to add without that much threat to
existing code
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]