[GitHub] [arrow-datafusion] houqp commented on a change in pull request #422: add output field name rfc

GitBox Wed, 26 May 2021 00:46:33 -0700


houqp commented on a change in pull request #422:
URL: https://github.com/apache/arrow-datafusion/pull/422#discussion_r639460465




##########
File path: docs/rfcs/output-field-name-semantic.md
##########
@@ -0,0 +1,236 @@
+# Datafusion output field name semantic
+
+Start Date: 2020-05-24
+
+## Summary
+
+Formally specify how Datafusion should construct its output field names based 
on
+provided user query.
+
+## Motivation
+
+By formalizing the output field name semantic, users will be able to access
+query output using consistent field names.
+
+## Detailed design
+
+The proposed semantic is chosen for the following reasons:
+
+* Ease of implementation, field names can be derived from physical expression
+without having to add extra logic to pass along arbitrary user provided input.
+Users are encouraged to use ALIAS expressions for full field name control.
+* Mostly compatible with Spark’s behavior except literal string handling.
+* Mostly backward compatible with current Datafusion’s behavior other than
+function name cases and parenthesis around operator expressions.
+
+###  Field name rules
+
+* All field names MUST not contain relation qualifier.
+  * Both `SELECT t1.id` and `SELECT id` SHOULD result in field name: `id`
+* Function names MUST be converted to lowercase.
+  * `SELECT AVG(c1)` SHOULD result in field name: `avg(c1)`

Review comment:
       @Dandandan I documented a survey of behavior from 
mysql/sqlite/postgres/spark in this doc as well, for example: 
https://github.com/houqp/arrow-datafusion/blob/qp_rfc/docs/rfcs/output-field-name-semantic.md#function-with-operators.
 Basically mysql and sqlite use the raw user query as the column name, postgres 
throws in the towel and just use `?column?` for everything while spark SQL 
constructs the column name based on the expression.
   
   I picked Spark's behavior because it's the one that is the closest to what 
we had at the time. But since you already implemented mysql and sqlite's 
behavior since then, i am happy to update the doc to account for that. In this 
case, we need two sets of rules, one for SQL queries, which is to just reuse 
what's provided in the query. The other one for dataframe queries, which is 
what I outlined in this doc.
   
   UPDATE: after taking a second look at #280, turns out the PR is closed due 
to the issue you mentioned above. I am now leaning back towards not preserving 
user provided names from query to keep things simple. It's one less thing to 
worry about and keeps the rules simple so we can apply the same set of rules 
for outputs produced from both SQL queries and dataframe queries. Let me know 
if you have a strong opinion on this though.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow-datafusion] houqp commented on a change in pull request #422: add output field name rfc

Reply via email to