milenkovicm opened a new issue, #1068:
URL: https://github.com/apache/datafusion-ballista/issues/1068

   # Ballista Reloaded - Roadmap Proposal
   
   [As it looks like we reached some kind of  
consensus](https://github.com/apache/datafusion-ballista/pull/1066) about 
moving Ballista from application to a library, I'd like to propose few targets 
that I see as short to medium term goals for ballista. This would address 
comments from @alamb & @Dandandan.
   
   Personally, I see two main short term goals, improving ballista usability, 
and decreasing maintainable code. Robustness may come up as one important goal, 
for which I don't see bandwidth or infrastructure at this point.
   
   ## 0. Keep up with DataFusion releases
   
   Nothing else to add :)
   
   ## 1. Usability
   
   It would be great if we could make writing ballista application as easy as 
DataFusion, ideally it should be very hard to spot the difference between them.
   
   ### 1.1 `BallistaContext` removal or evolution
   
   Can we replace `BallistaContext` with `SessionContext`? It would definitely 
improve usability as we would get most of the methods available in 
`SessionContext` also, some DataFusion applications would be deployable to 
Ballista with single line change.
   
   ```rust
   let ctx = SessionContext::ballista_standalone().await?;
   ```
   
   This approach may bring DataFusion Python on board as well, not sure how 
easy would it be.
   
   There are clear benefits of deprecation of `BallistaContext`, decision may 
hurt us in a long rung.
   
   `SessionContext` may bring usability issues with `UDF` support, 
configuration and basically all functionalities which need to be propagated 
across the cluster to work, and which may not be trivial to address. We may try 
to be address the by "turning off" those methods in ballista or just by 
documenting it, still some effort is needed. Or maybe its not issue at all?
   
   ### 1.2 Scheduler/executor binaries
   
   Ballista to a library should keep scheduler and executors binaries, as they 
would improve overall ballista usability and provide a quick way to bootstrap 
ballista cluster, for easy on boarding and testing purposes.
   
   We should focus our effort would be to provide a methods which would would 
help making custom scheduler/executors binaries easy. We could provide a way to 
create new scheduler/executor with default configurations, or add a way to plug 
in  object store registries, configurations, protocols, session context 
factories  ...
   
   ### 1.3 Ballista Contrib
   
   Move some of the components which are now optional to a separate 
sub-projects.
   
   ## 2. Protocol (client - scheduler - executor)
   
   Two protocols we may need to have a look at, client-scheduler and 
scheduler-executor.
   Two major use cases may be support for user defined functions, configuration 
propagation and replacement of protocol itself.
   
   ### 2.1 Propagate SessionContext configuration from client to executor
   
   At the moment SessionContext or some other state is not propagated from 
client to scheduler and executors.
   Enabling this would simplify overall configuration, it would enable 
use-cases where configuration can hold
   secret keys, object store configuration or similar.
   
   ### 2.2 Support for user defined functions
   
   I'm not aware of any examples where rust based UDFs are made serializable 
and shipped from client to server,
   many examples where python functions are shipped, so this effort may focus 
on python UDF. This effort
   would probably impact DataFusion plans, more details to follow.
   
   ### 2.3 Make client-scheduler protocol plugable
   
   Current client-scheduler protocol will be improved, also as there are new 
protocols coming out we may provide
   a way to replace default protocol.
   
   One (new) protocol example is [Spark 
Connect](https://spark.apache.org/docs/latest/spark-connect-overview.html), it 
is well thought approach covering most if not all cases for layered data 
processing. Users could be able to provide support for it and deploy frameworks 
like [Sail](https://github.com/lakehq/sail) on top of Ballista or even spark 
applications. Personally I find this interesting and with growing operators 
from DataFusion Comet supports it might bring interesting possibilities.
   
   Also, this is needed if flight-sql is made optional and moved to 'contrib' 
project.
   
   ## 3. Shuffle improvements
   
   @andygrove mentioned, [re-implement the shuffle writer/reader to re-use the 
logic in Comet which has a more efficient shuffle implementation based on 
Spark](https://discord.com/channels/885562378132000778/1179822705525141605/1277287101800386713).
 It would be great if we could see this implemented in short term.
   
   ## 4. Scheduler
   
   Improvements to internal scheduler could be a mid to long term goal, where 
users can bring their own strategies. Not many use-cases come to my mind apart 
from HDFS collocation or caching.
   
   Two possible items here:
   
   - Pluggable scheduler
   - Adding/improving Failure detector(s)
   
   ## 5. Observability
   
   As UI has been removed, and rest-api may be moved to contrib API we need to 
come up with notification mechanism external systems can subscribe to get 
scheduling events, execution metrics ... We would need to put some more effort 
to break down this functionality. I guess we could learn from Apache Spark
   
   ## 6. Testing
   
   Effort into getting more tests and covering edge cases. It may not be easy 
as it needs additional infrastructure and lot of effort for testing
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to