GitHub user thhapke closed a discussion: DataFusion-Comet, Spark and Future
We have implemented our own object store at SAP and have recently tested DataFusion, which has delivered impressive performance, particularly when compared to Spark. We are using PySpark within SAP and have come across the Apache DataFusion-Comet initiative. Since there is no dedicated discussion forum for the Comet repository, I’m reaching out here with a few questions: 1. Is Spark-Comet designed to run on a single node, or does it leverage Spark's distributed compute engine? 2. Does Comet support PySpark, and can we use it to run our existing PySpark scripts? 3. Is Comet intended as an interim solution until we can natively run DataFusion across multiple nodes (Ballista) without additional APIs? For context, 80-90% of our jobs could potentially run more efficiently on a single node, but for some tasks, distributed cluster computation is essential. It would be ideal to have a system with a decision gateway that deploys jobs in an optimized manner. With DataFusion, we hope to implement such a gateway, as Spark often consumes excessive resources for simpler tasks. I am looking forward to your insights and ideas. Cheers Thorsten GitHub link: https://github.com/apache/datafusion/discussions/12549 ---- This is an automatically sent email for [email protected]. To unsubscribe, please send an email to: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
