gabotechs commented on issue #23194: URL: https://github.com/apache/datafusion/issues/23194#issuecomment-4874456368
> in my (biased) view it would > 1. Validate that datafusion-distributed is general enough that it can be used with an existing system like Ballista > 2. It would potentially let us pool development efforts > > To be clear I am not saying it is required or even a good idea -- I was mostly dreaming that if it was to come to pass it would be cool One thing to note here is that the goal of `datafusion-distributed` is to be general for end users, allowing them to build their own distributed engine for their own datasources, for example: - `datafusion` extension points allow people to provide their own data sources, and customize how they are executed - `datafusion-distributed` extension points allow people to customize how those data sources should be distributed, what workers should be involved, how work is split, where should network boundaries be placed, etc... all while maintaining an almost exact 1:1 execution model as normal `datafusion`, just happening that some nodes move data over the wire rather than in-memory. So, the design goal of `datafusion-distributed` is to satisfy end users directly, which I'm not sure if Ballista would fit in that definition. There do are success stories of other systems, like ParadeDB, that are currently built on top of `datafusion-distributed` though, but I don't think it's in the same position as Ballista. The market gap `datafusion-distributed` fills is also different from Ballista: - Customizable In-memory engines: Trino is the closest here, but it's not nearly as customizable as the `datafusion` + `datafusion-distributed` combo, and we are confident that a system built on top `datafusion-distributed` is faster and cheaper to run than their Trino equivalent. - Customizable disk-based shuffling engines: Ballista would fall in this bucket, but there are already really big players doing a very good job in this space (Spark, Comet), so `datafusion-distributed` is not interested in competing here. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
