[GitHub] [arrow-datafusion] oersted opened a new issue, #4889: SQL on DataFrames

GitBox Fri, 13 Jan 2023 06:52:21 -0800


oersted opened a new issue, #4889:
URL: https://github.com/apache/arrow-datafusion/issues/4889


   **Is your feature request related to a problem or challenge? Please describe 
what you are trying to do.**
   
   At present, it is not very ergonomic to compose SQL. Say one wants to design 
a non-trivial pipeline with multiple stages by composing functions that perform 
various transformations on a DataFrame. This is only practical with the 
DataFrame API right now, by passing around partially transformed DataFrames and 
applying more transformations at each stage.
   
   **Describe the solution you'd like**
   
   Simply the ability to run SQL on an existing DataFrame (`DataFrame::sql`), 
so that a user always has the option to choose between SQL and the DataFrame 
API in more complex pipelines.
   
   I'd suggest registering a temporary table reference with a name like `self`.
   
   **Describe alternatives you've considered**
   
   It is might be technically possible to do this by registering intermediate 
views. However,
   * This would only work by staying within SQL in the whole pipeline, since 
there doesn't seem to be an API to create a view of a DataFrame either.
   * It would require passing around a reference to `SessionContext` everywhere.
   * Naming intermediate views, making sure they are globally unique, and 
passing around the names between functions as reference, which can be quite 
error-prone.
   * Dropping (garbage collecting) views when they are no longer needed.
   
   To be fair, other similar query engines do not have support for this either 
and have a similar behaviour. In Spark, there is the 
`DataFrame.createGlobalTempView` method, which is a bit more helpful but still 
means dealing with globally unique names.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow-datafusion] oersted opened a new issue, #4889: SQL on DataFrames

Reply via email to