mach-kernel commented on PR #1333: URL: https://github.com/apache/datafusion-ballista/pull/1333#issuecomment-3445260261
> If there is use case specific behaviour needed, users can change and compile its own client, scheduler or/and executor. Main reason was, as you state it in discord discussion, we're unable just to drop a jar on the class path. This way user can rely on functionality provided by the core ballista library but extend it in a way to support its own use case. Ah, sorry, I had no idea this was how it was intended to be used. The example repos are helpful, thank you for linking them. > I will have a better look, but at the moment most of the things look like they can be implemented out of the core library. you could create your own extensions codecs to support your specific tables. Maybe the missing part which could be added is registering additional (GRPC) service(es) in addition to core scheduler service, which could support centralised schema location. My thoughts are: in general the client is useful (e.g., vs Flight SQL endpoint) for the ergonomics of the DataFrame API, or secondarily running a 'bad query' (like selecting a col that does not exist) can error immediately in the client vs having to make the request to find out. To make a logical plan on the client it feels like all we need is a schema for a given table ref to do this. And to run UDFs (outside of packing Python UDFs) that may be present in the scheduler/executor runtime but not the client, if we know its signature and return type, we can do the same. On encountering a stub, the scheduler looks up the table or function in its runtime. The additional RPC are around getting these 'bare minimum' shapes over to the client. It would be a really nice out of box experience if the client could run queries against any remote cluster without needing to worry about all of its runtime customizations. It's convenient to get the default lib off of PyPI and/or not worry about shipping a client as you change the cluster runtime. I think this is how the client/Python lib would be most commonly used (with a persistent remote cluster). Maybe it could be a new 'remote mode'? The default client could still continue to not care about the concrete implementations of custom tables, fns, etc, outside of what is shipped with the default distribution. Re the Python library: if the client worked like this plus a solution was figured out for Python UDFs, then that would (at least for me) feel very useful. Hope this makes sense. I am not sure of what the limitations are to this idea (one is that it assumes the scheduler/executor runtimes are the same). Thanks for getting back to me, I appreciate it! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
