houqp opened a new pull request #55:
URL: https://github.com/apache/arrow-datafusion/pull/55


   This turned out to be a much larger/destructive PR than I initially 
expected. Would like to get some early feedback on the approach before I spend 
more time working on the clean up. So far I have been able to get all unit 
tests to pass.
   
   TODO:
   
     - [ ] Address FIXMEs and TODOs
     - [ ] Check integration tests
     - [ ] Rebase to latest master
   
   # Which issue does this PR close?
   
   
   
   # What changes are included in this PR?
   
   Here is the main design change introduced in this PR:
   
   * Physical column expression now references columns by unique indices 
instead of names
   * Logical column expression now wraps around a newly defined Column struct 
that can represent both qualified and unqualified columns
   
   Query execution flow change:
   1. When a TableScan plan has table_name set to `Some(name)`, all of the 
fields in its schema will be created as fully qualified fields. 
   1. Logical plan builder is responsible for normalizing all logical column 
expressions by adding qualifier based on schema wherever applicable.
   1. Logical plan optimizer operates on normalized column expressions.
   1. During physical planning, logical column expressions are resolved to 
physical column expressions with corresponding indices based on logical plan 
schemas. Notice a side effect of this is we don't look up column index during 
execution anymore. It is now done at planning time.
   1. During physical planning, physical schema (arrow schema) has all column 
qualifiers stripped.
   
   Some other changes introduced in this PR to help make all tests pass:
   
   * avoid coalesce for hash repartition plan
   * added partitioned hash join tests to hash_join module
   * added support for join with alias (for self join)
   * added join_using method to logical plan builder
   * fixed couple other bugs here and there along the way, but couldn't 
remember :(
   
   # Are there any user-facing changes?
   
   breaking api changes:
   
   * Column expression now wraps Column struct instead of String
   * TableScan plan now takes table_name as Option instead of String
   * Various dataframe scan method now takes table name as Option instead of 
&str
   * Physical Column expression now requires index field
   * logical builder join method now takes left and right keys as vector of 
Column structs
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]


Reply via email to