[GitHub] [spark] revans2 commented on issue #24795: [SPARK-27945][SQL] Minimal changes to support columnar processing

GitBox Fri, 07 Jun 2019 09:56:53 -0700

revans2 commented on issue #24795: [SPARK-27945][SQL] Minimal changes to 
support columnar processing
URL: https://github.com/apache/spark/pull/24795#issuecomment-499960263
 
 
   Yes, the reference counting and the expression changes could technically be 
removed.  The reference counting is there to be able to support columnar 
expressions.
   
   In reality I believe that none of this patch is technically required to make 
columnar processing work as a plugin. The reason I picked this level of 
abstraction, and adding in new rules that modify the physical plan was to try 
and reduce the amount of redundant code between a plugin and Spark SQL itself 
to the point that it would be preferable to use the plugin API over forking 
Spark.  Every other implementation of columnar processing on Spark has gone the 
route of forking because the existing extension APIs, thought technically 
sufficient, require any plugin to re-implement large parts of query planning 
and execution.  
   
   For example, user defined rules for mapping a logical to a physical plan run 
prior to the built in rules.  This means that if I add in a rule to replace a 
projection with a columnar enabled version the built in rule that creates the 
FileScanExec and does predicate push down into Parquet or Orc will not match 
the projection I just mapped. Predicate push down is then silently disabled.  
We ran into this issue when we tried to use the existing extension APIs.  
   
   Another example is ShuffleExchangeExec. The best performance optimization is 
to not do something, so if we can keep the data columnar through an exchange we 
have eliminated one columnar to row translation and one row to columnar 
translation. This is a part of the requirements in the SPIP.  Replacing that in 
the logical to physical plan translation is possible, although difficult.  But 
it also has the unintended consequence of the ReuseExchange rule not matching 
our alternative ShuffleExchangeExec and without full access to the physical 
plan I don't think we could re-implement that in the Logical to physical plan 
translation.
   
   I picked a very specific place in the life cycle of the plan to inject these 
rules because it was as close to the final physical plan as possible so there 
were the fewest number of transformation afterwards.  Even then the code 
generation rule afterwards still required some modifications.
   
   I think dropping the changes to the Expressions is doable, but not having a 
way to modify the physical plan very late in the process would require us to 
really rethink the direction we are taking.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [spark] revans2 commented on issue #24795: [SPARK-27945][SQL] Minimal changes to support columnar processing

Reply via email to