Re: [PR] Ballista 53.0.0 blog post [datafusion-site]

via GitHub Sun, 24 May 2026 08:27:22 -0700


milenkovicm commented on code in PR #188:
URL: https://github.com/apache/datafusion-site/pull/188#discussion_r3294876695



##########
content/blog/2026-05-24-datafusion-ballista-53.0.0.md:
##########
@@ -0,0 +1,289 @@
+---
+layout: post
+title: Apache DataFusion Ballista 53.0.0 Released
+date: 2026-05-24
+author: pmc
+categories: [release]
+---
+<!--
+{% comment %}
+Licensed to the Apache Software Foundation (ASF) under one or more
+contributor license agreements.  See the NOTICE file distributed with
+this work for additional information regarding copyright ownership.
+The ASF licenses this file to you under the Apache License, Version 2.0
+(the "License"); you may not use this file except in compliance with
+the License.  You may obtain a copy of the License at
+
+http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+{% endcomment %}
+-->
+
+[TOC]
+
+We are pleased to announce version [53.0.0] of [Apache DataFusion Ballista]. 
Ballista is a distributed query
+execution engine that enhances [Apache DataFusion] by enabling parallel 
execution of workloads across multiple
+nodes.
+
+[53.0.0]: 
https://github.com/apache/datafusion-ballista/blob/main/CHANGELOG.md#5300-2026-05-19
+[Apache DataFusion Ballista]: https://datafusion.apache.org/ballista/
+[Apache DataFusion]: https://datafusion.apache.org
+
+The last Ballista blog post covered [43.0.0], released in January 2025. In the 
year and a bit since, the
+project has quietly shipped a release for every DataFusion release: 44, 45, 
46, 47, 48, 49, 50, 51, 52, and
+now 53. This post catches up on what changed across that span, what landed 
specifically in 53.0.0, and where
+the project is heading.
+
+[43.0.0]: /blog/2025/02/02/datafusion-ballista-43.0.0/
+
+## How Ballista has changed since 43.0.0
+
+The story of 43.0.0 was one of simplification: experimental features were 
removed, the `BallistaContext` was
+deprecated in favor of the standard DataFusion `SessionContext`, and the 
project's release cadence was
+aligned with DataFusion's. The story of the year that followed has been one of 
putting things back, but
+under a more deliberate design.
+
+### Production deployment
+
+A lot of the work over this period has been about running Ballista in real 
clusters rather than just on a
+developer's laptop:
+
+- **S3 object store support** has been added to both the executor and 
scheduler binaries, including
+  credentials derived from the standard AWS environment, instance metadata, 
and explicit configuration.
+- **Docker images** for the scheduler and executor are now published on each 
release, making Docker Compose
+  and Kubernetes deployments straightforward.
+- **Cluster RPC** can be configured with TLS and custom headers, enabling 
deployments that need encrypted
+  inter-component traffic or pass-through authentication.
+- **Push-based task scheduling** is now the default, replacing pull-staged 
scheduling. Push scheduling
+  generally results in lower latency for short queries. Both modes remain 
available.
+- **Configurable gRPC timeouts**, retry policies, and message size limits make 
it easier to operate clusters
+  under varying network conditions.
+- **Memory bounds for executors** can now be set with `--memory-pool-size`, so 
executors no longer rely on
+  unbounded growth.
+
+### Shuffle subsystem
+
+The shuffle subsystem received the largest single rework over this period.
+
+- A new **sort-based shuffle writer** was added in 52.0.0 and made the default 
in 53.0.0. The hash-based
+  writer remains available behind a configuration flag.
+- **Buffered I/O** in the shuffle writer significantly reduces the number of 
small writes, and disk I/O
+  has been moved off the Tokio worker threads so that I/O latency does not 
block scheduling.
+- **Per-task spill thresholds** bound writer memory in the sort-based path, 
and a deferred materialization
+  step using `interleave_record_batch` reduces allocator pressure during 
shuffle write.
+- **Remote shuffle reads** now use Arrow Flight directly, with a client cache 
on the executor side, giving
+  better throughput and resource utilization for shuffle-heavy queries.
+- **Shuffle reader cleanup** removes job-local data once a job completes.
+
+### REST API and observability
+
+The scheduler's REST API has grown from a small status surface to the primary 
control plane for inspecting
+running and completed jobs:
+
+- The REST API is now enabled by default.
+- `/api/jobs` and `/api/jobs/<job_id>` expose job status, start/end times, 
logical and physical plans,
+  per-stage task information, and metrics.
+- Plans can be rendered as a tree directly from the REST API.
+- Per-executor system and process metrics are reported, and Prometheus metrics 
integration is available
+  behind a feature flag.
+
+### A new Python interface

Review Comment:
   Perhaps we should mention work with datafusion python team to improve this 
integration?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] Ballista 53.0.0 blog post [datafusion-site]

Reply via email to