This is an automated email from the ASF dual-hosted git repository.
andygrove pushed a commit to branch main
in repository https://gitbox.apache.org/repos/asf/datafusion-site.git
The following commit(s) were added to refs/heads/main by this push:
new 628d580 docs: add blog post introducing DataFusion Java 0.1.0 (#189)
628d580 is described below
commit 628d580e67d2242a6c3d5e1c412282b30ef5824e
Author: Andy Grove <[email protected]>
AuthorDate: Thu May 28 09:32:46 2026 -0600
docs: add blog post introducing DataFusion Java 0.1.0 (#189)
---
content/blog/2026-05-26-datafusion-java-0.1.0.md | 211 +++++++++++++++++++++++
1 file changed, 211 insertions(+)
diff --git a/content/blog/2026-05-26-datafusion-java-0.1.0.md
b/content/blog/2026-05-26-datafusion-java-0.1.0.md
new file mode 100644
index 0000000..2b37088
--- /dev/null
+++ b/content/blog/2026-05-26-datafusion-java-0.1.0.md
@@ -0,0 +1,211 @@
+---
+layout: post
+title: Introducing Apache DataFusion Java 0.1.0
+date: 2026-05-26
+author: pmc
+categories: [release]
+---
+<!--
+{% comment %}
+Licensed to the Apache Software Foundation (ASF) under one or more
+contributor license agreements. See the NOTICE file distributed with
+this work for additional information regarding copyright ownership.
+The ASF licenses this file to you under the Apache License, Version 2.0
+(the "License"); you may not use this file except in compliance with
+the License. You may obtain a copy of the License at
+
+http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+{% endcomment %}
+-->
+
+[TOC]
+
+We are pleased to announce the [0.1.0] release of [Apache DataFusion Java],
the first release
+of the project. DataFusion Java is a thin Java binding over the [Apache
DataFusion] query engine:
+SQL and DataFrame queries are planned and executed in native Rust, and results
return to the JVM
+as [Apache Arrow] record batches through the [Arrow C Data Interface].
+
+[0.1.0]: https://github.com/apache/datafusion-java/releases/tag/0.1.0
+[Apache DataFusion Java]: https://github.com/apache/datafusion-java
+[Apache DataFusion]: https://datafusion.apache.org
+[Apache Arrow]: https://arrow.apache.org
+[Arrow C Data Interface]:
https://arrow.apache.org/docs/format/CDataInterface.html
+
+This post focuses on what the project is and why it exists. The [changelog]
has the full list of
+features that landed in 0.1.0.
+
+[changelog]: https://github.com/apache/datafusion-java/blob/0.1.0/CHANGELOG.md
+
+## Why DataFusion Java
+
+DataFusion is already accessible from Rust, Python, and (via [Comet]) Spark.
JVM applications that
+want a fast, embeddable columnar query engine have so far had to either pull
in Spark — which is a
+heavy dependency for an embedded use case — or write their own JNI layer
around DataFusion. The
+goal of DataFusion Java is to make DataFusion a first-class JVM library: add
it to your `pom.xml`,
+construct a `SessionContext`, and run queries.
+
+[Comet]: https://datafusion.apache.org/comet/
+
+Typical use cases:
+
+- **Embedded analytics** inside JVM services that need to run SQL or DataFrame
queries over local
+ files or object storage without standing up a separate query system.
+- **Tools and frontends** that compile a higher-level surface language to
DataFusion logical plans
+ and need a JVM-side executor.
+- **Spark and Hadoop ecosystem integrations** that already speak Arrow and
want native columnar
+ execution from Java or Scala code without going through Spark.
+
+Because results cross the JNI boundary as Arrow C Data Interface streams,
there is no row-by-row
+marshalling or extra copy on the hot path — the JVM reads the same Arrow
buffers DataFusion wrote.
+
+## A first look
+
+```java
+import org.apache.arrow.memory.RootAllocator;
+import org.apache.arrow.vector.ipc.ArrowReader;
+import org.apache.datafusion.DataFrame;
+import org.apache.datafusion.SessionContext;
+
+try (var allocator = new RootAllocator();
+ var ctx = new SessionContext()) {
+
+ ctx.registerParquet("orders", "/path/to/orders.parquet");
+
+ try (DataFrame df = ctx.sql(
+ "SELECT o_orderpriority, COUNT(*) AS n " +
+ "FROM orders GROUP BY o_orderpriority");
+ ArrowReader reader = df.collect(allocator)) {
+ while (reader.loadNextBatch()) {
+ var batch = reader.getVectorSchemaRoot();
+ // process batch...
+ }
+ }
+}
+```
+
+`SessionContext` is the entry point — it owns the catalog of registered tables
and the query
+planner. `SessionContext` and `DataFrame` are `AutoCloseable`, so
`try`-with-resources releases
+native resources and Arrow buffers automatically.
+
+## What 0.1.0 includes
+
+The 0.1.0 surface is intentionally small but covers what most embedded
analytics use cases need.
+
+### Reading data
+
+- **Parquet, CSV, JSON, Avro, and Arrow IPC** via `registerParquet` /
`readParquet` (and the
+ equivalent entry points for the other formats), with format-specific read
options.
+- **Object store backends** registered on `SessionContextBuilder` so queries
can read directly
+ from S3, GCS, or other object stores.
+- **Java table providers.** `SessionContext.registerTable(name, provider)`
exposes a
+ Java-implemented `TableProvider`. The framework calls back into the Java
side per query to fetch
+ Arrow batches, with data flowing to native code over the Arrow C Data
Interface.
+
+### Querying
+
+- **SQL** via `ctx.sql(String)` and a **DataFrame API** with `select`,
`filter`, `limit`,
+ `distinct`, `sort`, `repartition`, `join` / `joinOn`, `dropColumns`, and
`withColumnRenamed`.
+- **Introspection** methods on `DataFrame`: `schema`, `explain`, `cache`,
`describe`.
+- **Streaming execution** via `df.executeStream(allocator)` for pulling
batches incrementally
+ without materializing the full result set in memory.
+- **Logical plans from `datafusion-proto`** via
`SessionContext.fromProto(byte[])` — useful for
+ cross-tool interop and for tools that compile their own surface language to
DataFusion plans.
+ The protobuf Java classes are generated from pinned `.proto` files at the
matching upstream
+ DataFusion tag.
+
+### Writing data
+
+- `DataFrame.writeCsv` and `DataFrame.writeJson` with format-specific options.
+
+### Extending DataFusion from Java
+
+- **Scalar UDFs.** Implement the `ScalarFunction` interface to add a
vectorised, Arrow-native SQL
+ function written in Java. The implementation declares its SQL name, argument
fields, return
+ field, and volatility, and supplies a per-batch `evaluate` body that reads
input Arrow vectors
+ and returns a result vector or a broadcast scalar.
+
+```java
+public final class AddOne implements ScalarFunction {
+ private static final ArrowType INT32 = new ArrowType.Int(32, true);
+
+ @Override public String name() { return "add_one"; }
+ @Override public List<Field> argFields() { return
List.of(Field.nullable("x", INT32)); }
+ @Override public Field returnField() { return Field.nullable("y", INT32); }
+ @Override public Volatility volatility() { return Volatility.IMMUTABLE; }
+
+ @Override
+ public ColumnarValue evaluate(BufferAllocator allocator,
ScalarFunctionArgs args) {
+ IntVector in = (IntVector) args.args().get(0).vector();
+ IntVector out = new IntVector("add_one", allocator);
+ out.allocateNew(in.getValueCount());
+ for (int i = 0; i < in.getValueCount(); i++) {
+ if (in.isNull(i)) out.setNull(i);
+ else out.set(i, in.get(i) + 1);
+ }
+ out.setValueCount(in.getValueCount());
+ return ColumnarValue.array(out);
+ }
+}
+
+ctx.registerUdf(new ScalarUdf(new AddOne()));
+```
+
+### Operational details
+
+- **JDK 17 or newer.** Set `JAVA_HOME` to point at it.
+- **Bundled native library.** The published JAR contains pre-built native
libraries for the
+ supported platforms; `NativeLibraryLoader` selects and loads the right one
at startup. You do
+ not need a Rust toolchain to use the library — only to build it from source.
+- **Typed exception hierarchy.** DataFusion errors crossing the JNI boundary
surface as a typed
+ `DataFusionException` hierarchy on the Java side rather than opaque runtime
exceptions.
+- **Configurable session.** `SessionContextBuilder` exposes batch size, target
partitions,
+ statistics collection, information schema, memory pool size, spill
directory, and the built-in
+ DataFusion cache manager.
+- **Runtime visibility.** `SessionContext.memoryUsage` and `runtimeStats`
expose DataFusion's
+ internal allocator and runtime statistics for monitoring and capacity
planning.
+
+## What this release is not
+
+DataFusion Java is at 0.1.0 and the API will change. A few specific limits are
worth calling out
+because they shape what you can build today:
+
+- **Scalar UDFs only.** Aggregate, window, and table function UDFs are not yet
exposed.
+- **Single-partition Java `TableProvider`s.** DataFusion sees a
Java-implemented table as one
+ partition, and there is no projection or filter pushdown — DataFusion
applies both on top of
+ the batches the Java side returns. The interface is intentionally minimal so
it can grow these
+ capabilities as default methods without breaking implementations.
+- **`SessionContext` is not thread-safe.** Use one per thread, or guard access
externally.
+
+## Looking ahead
+
+The rough direction for upcoming releases:
+
+1. **Round out the extension points.** Aggregate and window UDFs,
multi-partition
+ `TableProvider`s, and projection / filter pushdown into Java table
providers.
+2. **Close the gap with DataFusion's Rust API.** Anything that is reasonable
to expose to a Java
+ caller should be reachable from `SessionContext` and `DataFrame` with
idiomatic Java
+ signatures.
+3. **Track DataFusion releases.** The aim is to follow DataFusion's release
cadence so the Java
+ binding stays current with new features and fixes in the engine.
+
+## How to get involved
+
+- **Source:** <https://github.com/apache/datafusion-java>
+- **Issues:** <https://github.com/apache/datafusion-java/issues>
+- **Documentation:** see `docs/source/` in the repository for the user and
contributor guides.
+
+Bug reports, design discussions, and pull requests are all welcome. Issues
labelled
+[good first issue] are a good place to start.
+
+[good first issue]:
https://github.com/apache/datafusion-java/issues?q=is%3Aopen+is%3Aissue+label%3A%22good+first+issue%22
+
+## Thank you
+
+Thanks to everyone who contributed to making the first release possible, and
to the broader
+DataFusion and Arrow communities whose work this project builds on directly.
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]