andygrove opened a new pull request, #1: URL: https://github.com/apache/datafusion-java/pull/1
## Summary Seed the project with a minimal end-to-end JNI binding from the JVM to Apache DataFusion, plus the build, format, and license-check tooling needed for ongoing contribution. Opening as a draft to invite design feedback before we cut a first release. ## What is in this PR **Java surface** (`org.apache.datafusion`) - `SessionContext` — `AutoCloseable` session. `sql(String)` returns a lazy `DataFrame`; `registerParquet(String, String)` registers a local Parquet file as a SQL table. - `DataFrame` — lazy, `AutoCloseable`. `collect(BufferAllocator)` executes the plan and returns result batches as an Arrow `ArrowReader` via the [Arrow C Data Interface](https://arrow.apache.org/docs/format/CDataInterface.html). `collect()` consumes the DataFrame; `close()` releases the native plan if never collected and is idempotent. **Native side** (`native/`, crate `datafusion-jni`) - JNI entry points for `SessionContext` create/close/`registerParquet`/`createDataFrame` and `DataFrame` collect/close. - Results are exported as `FFI_ArrowArrayStream`, so the JVM reads batches without per-row JNI crossings or row-by-row copies. **Build and contributor tooling** - `pom.xml` with Maven wrapper, JUnit 5, Arrow 19, JDK 17 toolchain. - `apache-rat-plugin` (license-header check) and `spotless-maven-plugin` (`google-java-format`), both bound to the `verify` phase. - `Makefile` targets for native build, JVM build, test, clean, and TPC-H SF1 test data generation via `tpchgen-cli`. - GitHub Actions workflow running `spotless:check` and `cargo fmt --check` on push / PR to `main`. ## Project status This is the first code drop into a brand-new repository. The README labels the project as early development: the API is small and will change without notice, and there is no published release. A `Roadmap` section in the README outlines near-term priorities: session configuration, full `SessionContext`/`DataFrame` API parity with the Rust side, JVM-side plan construction via DataFusion's Protobuf representation, and Java-defined vectorized expressions over Arrow. ## Verification Locally, on this branch: - `./mvnw verify` — Java compile, unit tests (4 run, 0 fail, 1 skipped because TPC-H SF1 data is not generated), `spotless:check`, `apache-rat:check` (10 approved, 0 unapproved). - `cargo fmt --all -- --check` — clean. - `cargo clippy --all-targets --workspace -- -D warnings` — clean. The optional TPC-H integration test runs after `make tpch-data` (requires `tpchgen-cli`); it reads `lineitem.parquet` via `registerParquet` and asserts `SELECT COUNT(*)` returns 6,001,215. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
