houqp commented on pull request #55:
URL: https://github.com/apache/arrow-datafusion/pull/55#issuecomment-831115015


   Alright, I wrote some docs over the weekend to help align expectations:
   
   * Document for output schema field name semantics with examples: 
https://docs.google.com/document/d/1uviWavwEGD3qxwMk2AGkOgp6ENrvKGiMWQhHNbqPwhg/edit?usp=sharing
   * Proposed change to @jorgecarleitao 's invariant doc: 
https://docs.google.com/document/d/158gbfDp8pcakfriT2l7dHChwJB43_RV7lcWfxEC73ng/edit?usp=sharing
   * Updated invariant doc with proposed changes applied: 
https://docs.google.com/document/d/1dbK-3eaTHlzZcHzpTk1h-LA3b7dcxsVBcoZeVKYIPwI/edit?usp=sharing
   
   Please feel free to comment and make suggestions in the docs. One conclusion 
that came out of my research is everyone is naming output fields in a slightly 
different way and PostgreSQL wins the laziest developer award.
   
   @andygrove with regards to schemaless I am thinking it might be better to 
handle them with a new set of schemaless physical nodes. All of our current 
physical nodes requires knowing data type at planning time, which is not 
applicable for schemaless data sources. The switch to use index as unique 
identifier is unavoidable in this case because column names are not guaranteed 
to be unique anymore once relation is introduced. For example, two joined 
tables could both introduce columns with the same names. I will do more 
research and take a look at Drill's source code to see if there are better ways 
to handle schemaless data sources.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]


Reply via email to