revans2 commented on issue #24795: [SPARK-27945][SQL] Minimal changes to support columnar processing URL: https://github.com/apache/spark/pull/24795#issuecomment-499960263 Yes, the reference counting and the expression changes could technically be removed. The reference counting is there to be able to support columnar expressions. In reality I believe that none of this patch is technically required to make columnar processing work as a plugin. The reason I picked this level of abstraction, and adding in new rules that modify the physical plan was to try and reduce the amount of redundant code between a plugin and Spark SQL itself to the point that it would be preferable to use the plugin API over forking Spark. Every other implementation of columnar processing on Spark has gone the route of forking because the existing extension APIs, thought technically sufficient, require any plugin to re-implement large parts of query planning and execution. For example, user defined rules for mapping a logical to a physical plan run prior to the built in rules. This means that if I add in a rule to replace a projection with a columnar enabled version the built in rule that creates the FileScanExec and does predicate push down into Parquet or Orc will not match the projection I just mapped. Predicate push down is then silently disabled. We ran into this issue when we tried to use the existing extension APIs. Another example is ShuffleExchangeExec. The best performance optimization is to not do something, so if we can keep the data columnar through an exchange we have eliminated one columnar to row translation and one row to columnar translation. This is a part of the requirements in the SPIP. Replacing that in the logical to physical plan translation is possible, although difficult. But it also has the unintended consequence of the ReuseExchange rule not matching our alternative ShuffleExchangeExec and without full access to the physical plan I don't think we could re-implement that in the Logical to physical plan translation. I picked a very specific place in the life cycle of the plan to inject these rules because it was as close to the final physical plan as possible so there were the fewest number of transformation afterwards. Even then the code generation rule afterwards still required some modifications. I think dropping the changes to the Expressions is doable, but not having a way to modify the physical plan very late in the process would require us to really rethink the direction we are taking.
---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected] With regards, Apache Git Services --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
