revans2 commented on issue #24795: [SPARK-27945][SQL] Minimal changes to 
support columnar processing
URL: https://github.com/apache/spark/pull/24795#issuecomment-499932172
 
 
   @rxin
   
   I am a confused on exactly what you are proposing.
   
   > The ability to inspect and transform Catalyst plans, to create your own 
plans, which already exists
   That I understand and there are some large technical burdens with using the 
current set of APIs, but I can get into that more if needed.
   
   > The ability for operators to read columnar input, which somewhat exists at 
the scan level for Parquet and ORC but is a bit hacky at the moment. It can be 
generalized to support arbitrary operators.
   
   This is the one that I don't understand what you are proposing.  The only 
API even remotely associated with SparkPlan where columnar formatted data can 
leave the plan is
   
   
https://github.com/apache/spark/blob/5cdc50684883fe1134d1261b67127c10ec452b65/sql/core/src/main/scala/org/apache/spark/sql/execution/WholeStageCodegenExec.scala#L77-L82
   
   Which is a part of the `CodeGenSupport` trait. To look at it you would have 
no idea at all that it ever could actually hold `ColumnarBatch`es instead of 
`InternalRow`s.  This hack is really just a way so that a leaf node in the plan 
which pulls in `ColumnarBatchScan` can delegate to code generation the 
conversion between batches and rows. I'm not sure how that in particular can be 
generalized into something that would allow for columnar input to arbitrary 
operators.
   
   As such are you suggesting then that we keep the columnar execution path for 
SparkPlan like in this patch, but not the Expressions?  Are you suggesting 
instead that we don't change SparkPlan and have it so that each operator can 
output an `RDD[ColumnarBatch]`, but type erase it to an `RDD[InternalRow]`?  We 
would still need some kind of flag or something else to indicate that this is 
happening in the child and a `Rule` of some kind to make sure that the 
transitions were inserted at the appropriate places.
   
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to