houqp commented on a change in pull request #422: URL: https://github.com/apache/arrow-datafusion/pull/422#discussion_r639460465
########## File path: docs/rfcs/output-field-name-semantic.md ########## @@ -0,0 +1,236 @@ +# Datafusion output field name semantic + +Start Date: 2020-05-24 + +## Summary + +Formally specify how Datafusion should construct its output field names based on +provided user query. + +## Motivation + +By formalizing the output field name semantic, users will be able to access +query output using consistent field names. + +## Detailed design + +The proposed semantic is chosen for the following reasons: + +* Ease of implementation, field names can be derived from physical expression +without having to add extra logic to pass along arbitrary user provided input. +Users are encouraged to use ALIAS expressions for full field name control. +* Mostly compatible with Spark’s behavior except literal string handling. +* Mostly backward compatible with current Datafusion’s behavior other than +function name cases and parenthesis around operator expressions. + +### Field name rules + +* All field names MUST not contain relation qualifier. + * Both `SELECT t1.id` and `SELECT id` SHOULD result in field name: `id` +* Function names MUST be converted to lowercase. + * `SELECT AVG(c1)` SHOULD result in field name: `avg(c1)` Review comment: @Dandandan I documented a survey of behavior from mysql/sqlite/postgres/spark in this doc as well, for example: https://github.com/houqp/arrow-datafusion/blob/qp_rfc/docs/rfcs/output-field-name-semantic.md#function-with-operators. Basically mysql and sqlite use the raw user query as the column name, postgres throws in the towel and just use `?column?` for everything while spark SQL constructs the column name based on the expression. I picked Spark's behavior because it's the one that is the closest to what we had at the time. But since you already implemented mysql and sqlite's behavior since then, i am happy to update the doc to account for that. In this case, we need two sets of rules, one for SQL queries, which is to just reuse what's provided in the query. The other one for dataframe queries, which is what I outlined in this doc. UPDATE: after taking a second look at #280, turns out the PR is closed due to the issue you mentioned above. I am now leaning back towards not preserving user provided names from query to keep things simple. It's one less thing to worry about and keeps the rules simple so we can apply the same set of rules for outputs produced from both SQL queries and dataframe queries. Let me know if you have a strong opinion on this though. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org