andygrove opened a new pull request, #1:
URL: https://github.com/apache/datafusion-java/pull/1

   ## Summary
   
   Seed the project with a minimal end-to-end JNI binding from the JVM to 
Apache DataFusion, plus the build, format, and license-check tooling needed for 
ongoing contribution.
   
   Opening as a draft to invite design feedback before we cut a first release.
   
   ## What is in this PR
   
   **Java surface** (`org.apache.datafusion`)
   
   - `SessionContext` — `AutoCloseable` session. `sql(String)` returns a lazy 
`DataFrame`; `registerParquet(String, String)` registers a local Parquet file 
as a SQL table.
   - `DataFrame` — lazy, `AutoCloseable`. `collect(BufferAllocator)` executes 
the plan and returns result batches as an Arrow `ArrowReader` via the [Arrow C 
Data Interface](https://arrow.apache.org/docs/format/CDataInterface.html). 
`collect()` consumes the DataFrame; `close()` releases the native plan if never 
collected and is idempotent.
   
   **Native side** (`native/`, crate `datafusion-jni`)
   
   - JNI entry points for `SessionContext` 
create/close/`registerParquet`/`createDataFrame` and `DataFrame` collect/close.
   - Results are exported as `FFI_ArrowArrayStream`, so the JVM reads batches 
without per-row JNI crossings or row-by-row copies.
   
   **Build and contributor tooling**
   
   - `pom.xml` with Maven wrapper, JUnit 5, Arrow 19, JDK 17 toolchain.
   - `apache-rat-plugin` (license-header check) and `spotless-maven-plugin` 
(`google-java-format`), both bound to the `verify` phase.
   - `Makefile` targets for native build, JVM build, test, clean, and TPC-H SF1 
test data generation via `tpchgen-cli`.
   - GitHub Actions workflow running `spotless:check` and `cargo fmt --check` 
on push / PR to `main`.
   
   ## Project status
   
   This is the first code drop into a brand-new repository. The README labels 
the project as early development: the API is small and will change without 
notice, and there is no published release.
   
   A `Roadmap` section in the README outlines near-term priorities: session 
configuration, full `SessionContext`/`DataFrame` API parity with the Rust side, 
JVM-side plan construction via DataFusion's Protobuf representation, and 
Java-defined vectorized expressions over Arrow.
   
   ## Verification
   
   Locally, on this branch:
   
   - `./mvnw verify` — Java compile, unit tests (4 run, 0 fail, 1 skipped 
because TPC-H SF1 data is not generated), `spotless:check`, `apache-rat:check` 
(10 approved, 0 unapproved).
   - `cargo fmt --all -- --check` — clean.
   - `cargo clippy --all-targets --workspace -- -D warnings` — clean.
   
   The optional TPC-H integration test runs after `make tpch-data` (requires 
`tpchgen-cli`); it reads `lineitem.parquet` via `registerParquet` and asserts 
`SELECT COUNT(*)` returns 6,001,215.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to