nabuskey commented on code in PR #193:
URL: https://github.com/apache/datafusion-site/pull/193#discussion_r3350370882


##########
content/blog/2026-06-10-comet-eks.md:
##########
@@ -0,0 +1,111 @@
+---
+layout: post
+title: What two months with the Comet community got our Spark workload on 
Amazon EKS
+date: 2026-06-10
+author: Manabu McCloskey (AWS), Vara Bonthu (AWS), Andy Grove
+categories: [performance]
+summary: Apache DataFusion Comet now runs a 3TB TPC-DS workload significantly 
faster than vanilla Spark 3.5.8 on Amazon EKS. This post covers what changed in 
Comet over the past two releases, and how collaboration between EKS users and 
maintainers shaped that work. <br></br> <img 
src="/blog/images/comet-eks/version-arc-slope.png" width="60%" 
class="img-fluid" alt="TPC-DS performance arc across Comet versions"/>
+---
+
+
+<!--
+{% comment %}
+Licensed to the Apache Software Foundation (ASF) under one or more
+contributor license agreements.  See the NOTICE file distributed with
+this work for additional information regarding copyright ownership.
+The ASF licenses this file to you under the Apache License, Version 2.0
+(the "License"); you may not use this file except in compliance with
+the License.  You may obtain a copy of the License at
+
+http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+{% endcomment %}
+-->
+
+[TOC]
+
+
+## Introduction
+
+Apache Spark workloads are some of the most common and cost-intensive jobs 
running on Kubernetes. Our customers at AWS are always on the lookout for 
meaningful speedups, since any speedup translates directly into lower bills. 
That's what got us interested in Apache DataFusion Comet. The local TPC-DS and 
TPC-H benchmarks looked very good, and we wanted to see how Comet would hold up 
on a more realistic setup.
+
+When we ran a 3TB TPC-DS benchmark on Spark 3.5.8 on Amazon EKS, Comet was 11% 
slower than vanilla Spark, and we hit several operational issues along the way 
that made it hard to keep the cluster running smoothly. We built a reproducible 
benchmark kit and worked closely with the Comet maintainers to validate fixes 
as they landed. The same benchmark now runs **32% faster** than vanilla Spark 
on the same setup. The rest of this post walks through the issues we hit and 
what changed in Comet to address them.
+
+## Setup
+
+We ran TPC-DS at 3TB scale on Spark 3.5.8 on Amazon EKS, comparing vanilla 
Spark against Apache DataFusion Comet. Each executor pod was sized at 58 GB of 
RAM. We ran the same benchmark on three Comet versions, 0.14.0, 0.15.0, and 
0.16.0, so we could see how the project progressed across releases. The full 
cluster topology, Spark configurations, and benchmark scripts we used are 
documented on the [Data on EKS benchmark 
page](https://awslabs.github.io/data-on-eks/docs/benchmarks/spark-datafusion-comet-benchmark).
+
+## What we ran into
+
+We hit four classes of issues running Comet at scale on Amazon EKS. The first 
three were operational and the fourth was the source of most of the regression. 
The [Data on EKS benchmark 
page](https://github.com/awslabs/data-on-eks/blob/05d0590e019d14ed0d058f7c314db74a9b161599/website/docs/benchmarks/spark-datafusion-comet-benchmark.md)
 has full configurations and error traces for each one.
+
+### Excessive DNS queries
+
+Comet executors were generating up to 5,000 DNS queries per second per pod, 
roughly 500x what vanilla Spark issued, which pushed us against the Route 53 
Resolver per-ENI limit of 1,024 queries per second and triggered intermittent 
`UnknownHostException` failures. The root cause was that Comet's native Rust 
layer was creating a fresh object store instance for every Parquet file read, 
each with its own HTTP connection pool. The fix in 
[#3802](https://github.com/apache/datafusion-comet/pull/3802) added a 
process-wide cache for object stores, which collapsed DNS volume back to 
vanilla-Spark levels.
+
+### Unreliable S3 region detection
+
+Without an explicitly configured endpoint region, jobs failed intermittently 
with `Generic S3 error: Failed to resolve region`. The cause was the same 
caching gap as DNS: Comet called the S3 `HeadBucket` API for every Parquet file 
read to resolve the region, and those calls were getting throttled under 
concurrent load. The same fix in 
[#3802](https://github.com/apache/datafusion-comet/pull/3802) caches the 
resolved region per bucket, so `HeadBucket` runs once per bucket and the 
`endpoint.region` workaround is no longer required.
+
+### High memory footprint
+
+Comet consistently used about 67% more memory than vanilla Spark and required 
a 32 GB off-heap pool to run reliably. The root cause was in shuffle memory 
sizing: for each Spark task, Comet spun up two concurrent native execution 
contexts (pre-shuffle and shuffle writer), each allocating its own memory pool 
at the per-task limit. The fix in 
[#3924](https://github.com/apache/datafusion-comet/pull/3924) makes the two 
contexts share a single pool, bringing Comet's memory footprint much closer to 
vanilla Spark.

Review Comment:
   It is reported through `container_memory_working_set_bytes` so should be 
WSS. We can add disk usages and spill metrics to benchmark results going 
forward.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to