pchintar opened a new pull request, #4591:
URL: https://github.com/apache/datafusion-comet/pull/4591
## Which issue does this PR close?
Closes #2391 .
## Rationale for this change
Comet currently has limited support for Spark's in-memory cache.
When a table is cached and later read, the cached data cannot be consumed
directly by Comet operators. Instead, the execution plan falls back to Spark's
cache scan path and introduces an additional `CometSparkColumnarToColumnar`
conversion before execution can continue in Comet.
This extra conversion adds overhead to cached table scans and prevents
cached data from remaining on a native Comet execution path.
This PR adds native support for in-memory cached tables so that cached data
written in a Comet-compatible format can be read directly by Comet operators.
The implementation follows the approach discussed in #2391 by introducing a
dedicated `CometInMemoryTableScanExec` together with a custom cache serializer.
This allows cached data to be stored and retrieved in a Comet-compatible format
while preserving Spark's existing cache management and fallback behavior.
## What changes are included in this PR?
This PR introduces a native cache path for in-memory cached tables behind a
new configuration:
`spark.comet.exec.inMemoryCache.enabled`
When enabled:
* Cached data is stored using a Comet-specific cache serializer.
* Cached data is represented as `CometCachedBatch`.
* Cached tables are scanned using `CometInMemoryTableScanExec`.
* Cached data can be consumed directly by Comet operators without
introducing a `CometSparkColumnarToColumnar` conversion.
When disabled:
* Spark's existing cache serializer continues to be used.
* Existing cache scan behavior is preserved.
## How are these changes tested?
Added `CometInMemoryCacheSuite` covering:
* Comet-native cache scan over `CometCachedBatch`
* Fallback behavior when native cache support is disabled
* Multi-partition cached tables
* Empty cached tables
* Projection-only cache reads
* Shuffle execution after cached table scans
Verified with:
```bash
./mvnw -pl spark -DskipTests test-compile
./mvnw test -pl spark \
-DwildcardSuites=org.apache.comet.exec.CometInMemoryCacheSuite
```
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]