milenkovicm commented on PR #1333:
URL:
https://github.com/apache/datafusion-ballista/pull/1333#issuecomment-3445992238
> My thoughts are: in general the client is useful (e.g., vs Flight SQL
endpoint) for the ergonomics of the DataFrame API, or secondarily running a
'bad query' (like selecting a col that does not exist) can error immediately in
the client vs having to make the request to find out.
I think this is the case, SQL will generate logical plan on the client side
without need to do request to find out.
> To make a logical plan on the client it feels like all we need is a schema
for a given table ref to do this. And to run UDFs (outside of packing Python
UDFs) that may be present in the scheduler/executor runtime but not the client,
if we know its signature and return type, we can do the same. On encountering a
stub, the scheduler looks up the table or function in its runtime. The
additional RPC are around getting these 'bare minimum' shapes over to the
client.
I don't disagree with what you say, we only differ in opinion who should
provide this. I believe you could extend current ballista to do this for you
(or you could do it if we add ability to register additional gprc services as I
have mentioned)
> It would be a really nice out of box experience if the client could run
queries against any remote cluster without needing to worry about all of its
runtime customizations. It's convenient to get the default lib off of PyPI
and/or not worry about shipping a client as you change the cluster runtime.
Current strategy is to provide a datafusion distribution framework, not to
create a query engine. I do agree UX is much better if everything comes out of
the box, but we leave box to user to pack.
> I think this is how the client/Python lib would be most commonly used
(with a persistent remote cluster). Maybe it could be a new 'remote mode'? The
default client could still continue to not care about the concrete
implementations of custom tables, fns, etc, outside of what is shipped with the
default distribution.
Consider default implementation like show case what can be done, not like
final solution. We are happy to make it more flexible so clients could
implement additional use cases.
> Re the Python library: if the client worked like this plus a solution was
figured out for Python UDFs, then that would (at least for me) feel very useful.
One of the examples I've shared demonstrates use of python UDF, probably its
not perfect solution but its a start
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]