gabotechs commented on issue #23194:
URL: https://github.com/apache/datafusion/issues/23194#issuecomment-4874456368

   > in my (biased) view it would
   > 1. Validate that datafusion-distributed is general enough that it can be 
used with an existing system like Ballista
   > 2. It would potentially let us pool development efforts
   > 
   > To be clear I am not saying it is required or even a good idea -- I was 
mostly dreaming that if it was to come to pass it would be cool
   
   One thing to note here is that the goal of `datafusion-distributed` is to be 
general for end users, allowing them to build their own distributed engine for 
their own datasources, for example:
   - `datafusion` extension points allow people to provide their own data 
sources, and customize how they are executed
   - `datafusion-distributed` extension points allow people to customize how 
those data sources should be distributed, what workers should be involved, how 
work is split, where should network boundaries be placed, etc... all while 
maintaining an almost exact 1:1 execution model as normal `datafusion`, just 
happening that some nodes move data over the wire rather than in-memory.
   
   So, the design goal of `datafusion-distributed` is to satisfy end users 
directly, which I'm not sure if Ballista would fit in that definition. There do 
are success stories of other systems, like ParadeDB, that are currently built 
on top of `datafusion-distributed` though, but I don't think it's in the same 
position as Ballista.
   
   The market gap `datafusion-distributed` fills is also different from 
Ballista:
   - Customizable In-memory engines: Trino is the closest here, but it's not 
nearly as customizable as the `datafusion` + `datafusion-distributed` combo, 
and we are confident that a system built on top `datafusion-distributed` is 
faster and cheaper to run than their Trino equivalent.
   - Customizable disk-based shuffling engines: Ballista would fall in this 
bucket, but there are already really big players doing a very good job in this 
space (Spark, Comet), so `datafusion-distributed` is not interested in 
competing here.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to