[GitHub] [arrow-datafusion] houqp commented on pull request #55: Support qualified columns in queries

GitBox Mon, 03 May 2021 01:40:29 -0700


houqp commented on pull request #55:
URL: https://github.com/apache/arrow-datafusion/pull/55#issuecomment-831115015

Alright, I wrote some docs over the weekend to help align expectations:

* Document for output schema field name semantics with examples:
https://docs.google.com/document/d/1uviWavwEGD3qxwMk2AGkOgp6ENrvKGiMWQhHNbqPwhg/edit?usp=sharing
* Proposed change to @jorgecarleitao 's invariant doc:
https://docs.google.com/document/d/158gbfDp8pcakfriT2l7dHChwJB43_RV7lcWfxEC73ng/edit?usp=sharing
* Updated invariant doc with proposed changes applied:
https://docs.google.com/document/d/1dbK-3eaTHlzZcHzpTk1h-LA3b7dcxsVBcoZeVKYIPwI/edit?usp=sharing

Please feel free to comment and make suggestions in the docs. One conclusion
that came out of my research is everyone is naming output fields in a slightly
different way and PostgreSQL wins the laziest developer award.

@andygrove with regards to schemaless I am thinking it might be better to
handle them with a new set of schemaless physical nodes. All of our current
physical nodes requires knowing data type at planning time, which is not
applicable for schemaless data sources. The switch to use index as unique
identifier is unavoidable in this case because column names are not guaranteed
to be unique anymore once relation is introduced. For example, two joined
tables could both introduce columns with the same names. I will do more
research and take a look at Drill's source code to see if there are better ways
to handle schemaless data sources.

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow-datafusion] houqp commented on pull request #55: Support qualified columns in queries

Reply via email to