revans2 edited a comment on issue #24795: [SPARK-27945][SQL] Minimal changes to support columnar processing URL: https://github.com/apache/spark/pull/24795#issuecomment-499932172 @rxin I am confused on exactly what you are proposing. > The ability to inspect and transform Catalyst plans, to create your own plans, which already exists That I understand and there are some large technical burdens with using the current set of APIs, but I can get into that more if needed. > The ability for operators to read columnar input, which somewhat exists at the scan level for Parquet and ORC but is a bit hacky at the moment. It can be generalized to support arbitrary operators. This is the one that I don't understand what you are proposing. The only API even remotely associated with SparkPlan where columnar formatted data can leave the plan is https://github.com/apache/spark/blob/5cdc50684883fe1134d1261b67127c10ec452b65/sql/core/src/main/scala/org/apache/spark/sql/execution/WholeStageCodegenExec.scala#L77-L82 Which is a part of the `CodeGenSupport` trait. To look at it you would have no idea at all that it ever could actually hold `ColumnarBatch`es instead of `InternalRow`s. This hack is really just a way so that a leaf node in the plan which pulls in `ColumnarBatchScan` can delegate to code generation the conversion between batches and rows. I'm not sure how that in particular can be generalized into something that would allow for columnar input to arbitrary operators. As such are you suggesting then that we keep the columnar execution path for SparkPlan like in this patch, but not the Expressions? Are you suggesting instead that we don't change SparkPlan and have it so that each operator can output an `RDD[ColumnarBatch]`, but type erase it to an `RDD[InternalRow]`? We would still need some kind of flag or something else to indicate that this is happening in the child and a `Rule` of some kind to make sure that the transitions were inserted at the appropriate places.
---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected] With regards, Apache Git Services --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
