LantaoJin opened a new issue, #74:
URL: https://github.com/apache/datafusion-java/issues/74
### Is your feature request related to a problem or challenge?
The Rust API to enable any of these is
`RuntimeEnvBuilder::with_cache_manager(CacheManagerConfig)`. The Java binding
exposes none of it — every `SessionContext` ends up with the no-op defaults
(metadata-cache enabled but at 0 size, list-files and stats caches disabled).
For a parquet workload reading the same footer thousands of times across
queries, that means every footer read goes back to the object store. For a
long-lived statistics-driven planner, no stats persist across queries.
There is no Java surface today to plug in or even just turn on the built-in
caches.
### Describe the solution you'd like
Add a `cacheManager(CacheManagerOptions)` setter to `SessionContextBuilder`.
The options object mirrors upstream `CacheManagerConfig` 1:1 — three
independent toggles, each with the same knobs upstream exposes:
```java
SessionContext ctx = SessionContext.builder()
.cacheManager(CacheManagerOptions.builder()
.fileMetadataCache(64L << 20) // 64 MiB cap
.listFilesCache(8L << 20, Duration.ofMinutes(5)) // 8 MiB cap,
5min TTL
.fileStatisticsCache(true)
.build())
.build();
```
Semantics, all matching upstream (datafusion).
Three independent toggles, all matching upstream:
| Field | Java unset → Rust behaviour
|
|--------------------------------|-----------------------------------------------------|
| `fileMetadataCache(maxBytes)` | leave `metadata_cache_limit` at upstream
default |
| `listFilesCache(maxBytes,ttl)` | leave `list_files_cache = None`
(disabled) |
| `fileStatisticsCache(enabled)` | leave `table_files_statistics_cache =
None` |
### Describe alternatives you've considered
**Java SPI for cache implementations** —
`org.apache.datafusion.cache.{FileMetadataCache, ListFilesCache,
FileStatisticsCache}` interfaces that callers implement in Java, with arbitrary
upcalls into Java for every cache `get`/`put`. **Rejected for v1.** A
`FileMetadataCache::get` runs once per parquet file in a scan; routing those
through JNI upcalls turns a hot path into a slow path. The upstream Rust traits
are easy to implement; embedders that want a custom cache (LRU on disk,
Foyer-backed, network-replicated) can build it Rust-side and ship a fork until
the day the cost-benefit of a Java SPI flips. The `Out of scope` note in
`CONTRIB_ISSUES.md` for #13 ("Pluggable `CacheManager` *implementations*
written in Java") makes this explicit.
### Additional context
_No response_
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]