milenkovicm commented on PR #1333:
URL: 
https://github.com/apache/datafusion-ballista/pull/1333#issuecomment-3445992238

   > My thoughts are: in general the client is useful (e.g., vs Flight SQL 
endpoint) for the ergonomics of the DataFrame API, or secondarily running a 
'bad query' (like selecting a col that does not exist) can error immediately in 
the client vs having to make the request to find out.
   
   I think this is the case, SQL will generate logical plan on the client side 
without need to do request to find out.
    
   > To make a logical plan on the client it feels like all we need is a schema 
for a given table ref to do this. And to run UDFs (outside of packing Python 
UDFs) that may be present in the scheduler/executor runtime but not the client, 
if we know its signature and return type, we can do the same. On encountering a 
stub, the scheduler looks up the table or function in its runtime. The 
additional RPC are around getting these 'bare minimum' shapes over to the 
client.
   
   I don't disagree with what you say, we only differ in opinion who should 
provide this. I believe you could extend current ballista to do this for you 
(or you could do it if we add ability to register additional gprc services as I 
have mentioned) 
   
   > It would be a really nice out of box experience if the client could run 
queries against any remote cluster without needing to worry about all of its 
runtime customizations. It's convenient to get the default lib off of PyPI 
and/or not worry about shipping a client as you change the cluster runtime.
   
   Current strategy is to provide a datafusion distribution framework, not to 
create a query engine. I do agree UX is much better  if everything comes out of 
the box, but we leave box to user to pack.
   
   > I think this is how the client/Python lib would be most commonly used 
(with a persistent remote cluster). Maybe it could be a new 'remote mode'? The 
default client could still continue to not care about the concrete 
implementations of custom tables, fns, etc, outside of what is shipped with the 
default distribution.
   
   Consider default implementation like show case what can be done, not like 
final solution. We are happy to make it more flexible so clients could 
implement additional use cases. 
   
   > Re the Python library: if the client worked like this plus a solution was 
figured out for Python UDFs, then that would (at least for me) feel very useful.
   
   One of the examples I've shared demonstrates use of python UDF, probably its 
not perfect solution but its a start
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to