[GitHub] [arrow-datafusion] houqp edited a comment on pull request #952: Change compound column field name rules

GitBox Sun, 29 Aug 2021 23:31:42 -0700


houqp edited a comment on pull request #952:
URL: https://github.com/apache/arrow-datafusion/pull/952#issuecomment-908069027



   Our current output field name semantics mostly aligns with what spark has, 
which strips column qualifiers in all cases.
   
   This PR changes the semantics to handle compound and bare column names 
differently. For bare names, we strip column qualifiers, but not for compound 
columns.
   
   Unlike Spark, relational databases like MySQL, Posgresql and Sqlite all 
treat compound and bare column names differently. MySQL, Postgresql and Sqlite 
strip qualifers for bare column names like we do. But MySQL and Sqlite use raw 
user query input for compound column names. Postgresql on the other hand just 
uses `?column?` for all compound column names.
   
   In all compute engines, users are not expected to reference compound columns 
by generated names because these names are not guaranteed to be valid sql 
expressions. Instead, they should always manually alias them. As a result, the 
output filed name for compound columns are just there for display/debug purpose 
only. See also @Dandandan 's comment at 
https://github.com/apache/arrow-datafusion/pull/280#issuecomment-834805975.
   
   @jorgecarleitao as for the counter-intuitive example you mentioned, the 
current implementation will output field name `SUM(id)` column name for query 
`SELECT SUM(t1.id)`, while the proposed new behavior will output `SUM(t1.id)` 
field name for query `SELECT SUM(id)`. So both of them will not use the exact 
user query input as the output field name. Either way, it should have no impact 
to how users construct queries.
   
   The proposed new behavior provides better UX compared to Postgresql's 
`?column?` column name. I think it's also an improvement over the current 
(spark's) behavior because it will produce an unambiguous column name for 
queries like `SELECT t1.id * t2.id FROM t1 JOIN t2 USING id`. The current 
behavior will output `id * id`, which is not as good as `t1.id * t2.id` IMO.
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow-datafusion] houqp edited a comment on pull request #952: Change compound column field name rules

Reply via email to