[GitHub] [spark] revans2 commented on issue #24795: [SPARK-27945][SQL] Minimal changes to support columnar processing

GitBox Fri, 07 Jun 2019 08:33:40 -0700

revans2 commented on issue #24795: [SPARK-27945][SQL] Minimal changes to
support columnar processing
URL: https://github.com/apache/spark/pull/24795#issuecomment-499932172

@rxin

I am a confused on exactly what you are proposing.

> The ability to inspect and transform Catalyst plans, to create your own
plans, which already exists
That I understand and there are some large technical burdens with using the
current set of APIs, but I can get into that more if needed.

> The ability for operators to read columnar input, which somewhat exists at
the scan level for Parquet and ORC but is a bit hacky at the moment. It can be
generalized to support arbitrary operators.

This is the one that I don't understand what you are proposing. The only
API even remotely associated with SparkPlan where columnar formatted data can
leave the plan is

https://github.com/apache/spark/blob/5cdc50684883fe1134d1261b67127c10ec452b65/sql/core/src/main/scala/org/apache/spark/sql/execution/WholeStageCodegenExec.scala#L77-L82

Which is a part of the `CodeGenSupport` trait. To look at it you would have
no idea at all that it ever could actually hold `ColumnarBatch`es instead of
`InternalRow`s. This hack is really just a way so that a leaf node in the plan
which pulls in `ColumnarBatchScan` can delegate to code generation the
conversion between batches and rows. I'm not sure how that in particular can be
generalized into something that would allow for columnar input to arbitrary
operators.

As such are you suggesting then that we keep the columnar execution path for
SparkPlan like in this patch, but not the Expressions? Are you suggesting
instead that we don't change SparkPlan and have it so that each operator can
output an `RDD[ColumnarBatch]`, but type erase it to an `RDD[InternalRow]`? We
would still need some kind of flag or something else to indicate that this is
happening in the child and a `Rule` of some kind to make sure that the
transitions were inserted at the appropriate places.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [spark] revans2 commented on issue #24795: [SPARK-27945][SQL] Minimal changes to support columnar processing

Reply via email to