Re: [D] DataFusion-Comet, Spark and Future [datafusion]

via GitHub Tue, 25 Nov 2025 05:21:05 -0800


GitHub user thhapke closed a discussion: DataFusion-Comet, Spark and Future


We have implemented our own object store at SAP and have recently tested 
DataFusion, which has delivered impressive performance, particularly when 
compared to Spark. We are using PySpark within SAP and have come across the 
Apache DataFusion-Comet initiative. Since there is no dedicated discussion 
forum for the Comet repository, I’m reaching out here with a few questions:

1. Is Spark-Comet designed to run on a single node, or does it leverage Spark's 
distributed compute engine?
2. Does Comet support PySpark, and can we use it to run our existing PySpark 
scripts?
3. Is Comet intended as an interim solution until we can natively run 
DataFusion across multiple nodes (Ballista) without additional APIs?

For context, 80-90% of our jobs could potentially run more efficiently on a 
single node, but for some tasks, distributed cluster computation is essential. 
It would be ideal to have a system with a decision gateway that deploys jobs in 
an optimized manner. With DataFusion, we hope to implement such a gateway, as 
Spark often consumes excessive resources for simpler tasks.

I am looking forward to your insights and ideas.

Cheers Thorsten

GitHub link: https://github.com/apache/datafusion/discussions/12549

----
This is an automatically sent email for [email protected].
To unsubscribe, please send an email to: 
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [D] DataFusion-Comet, Spark and Future [datafusion]

Reply via email to