LantaoJin opened a new issue, #70:
URL: https://github.com/apache/datafusion-java/issues/70

   ### Is your feature request related to a problem or challenge?
   
   `SessionContext.registerParquet(name, path)` (and the read/register
   counterparts for CSV, NDJSON, Arrow, Avro) accept arbitrary path strings,
   but there is no Java surface to attach an `object_store::ObjectStore`
   implementation to a URL scheme + bucket. As a result, today the only
   remote-storage paths that work from `datafusion-java` are the ones the
   default `RuntimeEnv` resolves out of process-level environment variables
   — there is no way to:
   
   - Pass S3 access key / secret / session token / region / endpoint per
     context (so multi-tenant Java apps cannot give two contexts different
     buckets or different credentials in the same JVM).
   - Use anything other than the AWS-SDK env-var defaults (no GCS, no
     Azure Blob, no plain HTTP-listing, no MinIO with a custom endpoint).
   - Re-point an `s3://` URL at a different region / endpoint without
     process-wide env mutation.
   
   Concretely, the following fails today even with valid AWS env vars set,
   because no S3 store is registered with the runtime:
   
   ```java
   ctx.registerParquet("orders", "s3://my-bucket/orders/2026-05/");
   // RuntimeException: No suitable object store found for 
s3://my-bucket/orders/2026-05/
   ```
   
   DataFusion's Rust `RuntimeEnv::register_object_store(url, store)` already
   solves this end of the problem; the gap is purely in the Java surface
   above the JNI line.
   
   ### Describe the solution you'd like
   
   A typed registration API at construction time, on the existing
   `SessionContextBuilder`. Stores are registered before `SessionContext` is
   returned; the registration travels through the same
   `session_options.proto` byte channel that the rest of the builder uses,
   so no new JNI signature is needed.
   
   ```java
   SessionContext ctx = SessionContext.builder()
       .registerObjectStore(ObjectStoreOptions.s3()
           .bucket("my-bucket")
           .region("us-east-1")
           .accessKeyId("...")
           .secretAccessKey("...")
           .build())
       .registerObjectStore(ObjectStoreOptions.s3()
           .bucket("other-bucket")
           .region("eu-west-1")
           .endpoint("https://minio.internal:9000";)
           .allowHttp(true)
           .build())
       .build();
   
   ctx.registerParquet("orders", "s3://my-bucket/orders/");
   ctx.registerParquet("audit", "s3://other-bucket/audit/");
   ```
   
   `ObjectStoreOptions` is a sealed-style hierarchy with one concrete
   factory per backend:
   
   - `ObjectStoreOptions.s3()` — `AmazonS3` (also covers MinIO / R2 / any
     S3-compatible endpoint via `endpoint(...)` + `allowHttp(...)`).
   - `ObjectStoreOptions.gcs()` — `GoogleCloudStorage`.
   - `ObjectStoreOptions.azure()` — `MicrosoftAzure` Blob Storage.
   - `ObjectStoreOptions.http()` — listing-capable HTTP store.
   
   For a v1 the four above are the natural set: they're the four that
   upstream `object_store` exposes as first-class, and they cover essentially
   every reported `s3://` / `gs://` / `az://` / `https://` use case I've
   seen in datafusion-java issues.
   
   Each builder maps 1:1 to the corresponding `object_store` Rust builder
   fields (`AmazonS3Builder`, `GoogleCloudStorageBuilder`,
   `MicrosoftAzureBuilder`, `HttpBuilder`); the JNI side decodes the proto
   once and constructs the store with `.build()`.
   
   The URL that DataFusion uses to look up the store is derived from the
   options — for S3 it's `s3://<bucket>` (matching how
   `AmazonS3Builder::with_bucket_name(b).build()` is registered). Callers
   who want a non-default scheme (e.g. `s3a://`) can opt in via an explicit
   `url(...)` setter.
   
   ### Describe alternatives you've considered
   
   **A free-form `Map<String,String>` setter.** Easier on the API surface
   but loses every type-safety / discoverability benefit. The four
   `ObjectStoreOptions` factories are mostly mechanical — once one is
   written, the rest follow the same shape — so the cost is small.
   
   **Exposing a Java `ObjectStore` SPI** so callers can implement their own
   backend in Java. Out of scope for v1: every `get`/`list`/`put`/`delete`
   becomes a JNI upcall, and the request rate of those calls (one per
   parquet footer, plus per row group) makes Java upcalls a serious hot
   path. The right shape there is a separate issue once anyone reports a
   real need; for now, embedders that want a custom backend have the same
   options Rust users do (build their own `ObjectStore` impl in Rust and
   ship a fork).
   
   **Process-level singletons via `RuntimeEnv` builder.** Doesn't scale to
   multi-tenant JVMs that want different credentials per context. The
   proposed API already supports the singleton case (one builder, one
   context, one shared registration) without forcing it.
   
   ### Additional context
   
   The cloud backends are heavy dependencies, so the `datafusion-jni` crate
   should expose them behind opt-in Cargo features:
   
   ```toml
   [features]
   default = []
   object-store-aws   = ["object_store/aws"]
   object-store-gcp   = ["object_store/gcp"]
   object-store-azure = ["object_store/azure"]
   object-store-http  = ["object_store/http"]
   ```
   
   The Java side always *compiles* the four `ObjectStoreOptions.*` classes;
   the native side panics with a clear error if the corresponding feature
   is not built in. Default `make test` builds with all four enabled (so CI
   covers them); a slimmer downstream build (just `object-store-aws`, say)
   is supported but trips an explicit error from the JNI layer if a caller
   tries to register a backend that isn't compiled in.
   
   This matches PR #60's pattern for the `avro` feature: the Cargo feature
   is opt-in on `object_store`, but always enabled in our default build so
   that Java callers can rely on backends being present without juggling
   features through Maven.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to