Re: [PR] [Client]: Sync scheduler's catalog metadata to client, to allow planning against remote tables [datafusion-ballista]

via GitHub Fri, 24 Oct 2025 16:47:27 -0700


mach-kernel commented on PR #1333:
URL: 
https://github.com/apache/datafusion-ballista/pull/1333#issuecomment-3445260261


   > If there is use case specific behaviour needed, users can change and 
compile its own client, scheduler or/and executor. Main reason was, as you 
state it in discord discussion, we're unable just to drop a jar on the class 
path. This way user can rely on functionality provided by the core ballista 
library but extend it in a way to support its own use case.
   
   Ah, sorry, I had no idea this was how it was intended to be used. The 
example repos are helpful, thank you for linking them.
   
   > I will have a better look, but at the moment most of the things look like 
they can be implemented out of the core library. you could create your own 
extensions codecs to support your specific tables. Maybe the missing part which 
could be added is registering additional (GRPC) service(es) in addition to core 
scheduler service, which could support centralised schema location.
   
   My thoughts are: in general the client is useful (e.g., vs Flight SQL 
endpoint) for the ergonomics of the DataFrame API, or secondarily running a 
'bad query' (like selecting a col that does not exist) can error immediately in 
the client vs having to make the request to find out. 
   
   To make a logical plan on the client it feels like all we need is a schema 
for a given table ref to do this. And to run UDFs (outside of packing Python 
UDFs) that may be present in the scheduler/executor runtime but not the client, 
if we know its signature and return type, we can do the same. On encountering a 
stub, the scheduler looks up the table or function in its runtime. The 
additional RPC are around getting these 'bare minimum' shapes over to the 
client.
   
   It would be a really nice out of box experience if the client could run 
queries against any remote cluster without needing to worry about all of its 
runtime customizations. It's convenient to get the default lib off of PyPI 
and/or not worry about shipping a client as you change the cluster runtime. 
   
   I think this is how the client/Python lib would be most commonly used (with 
a persistent remote cluster). Maybe it could be a new 'remote mode'? The 
default client could still continue to not care about the concrete 
implementations of custom tables, fns, etc, outside of what is shipped with the 
default distribution.
   
   Re the Python library: if the client worked like this plus a solution was 
figured out for Python UDFs, then that would (at least for me) feel very 
useful. 
   
   Hope this makes sense. I am not sure of what the limitations are to this 
idea (one is that it assumes the scheduler/executor runtimes are the same). 
Thanks for getting back to me, I appreciate it!
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] [Client]: Sync scheduler's catalog metadata to client, to allow planning against remote tables [datafusion-ballista]

Reply via email to