cloud-fan opened a new pull request #25264: [SPARK-28213][SQL][followup] code 
cleanup for columnar execution framework
URL: https://github.com/apache/spark/pull/25264
 
 
   ## What changes were proposed in this pull request?
   
   I did a post-hoc review of https://github.com/apache/spark/pull/25008 , and 
would like to propose some cleanups/fixes/improvements:
   1. Avoid double-counting the numOutputRows metrics. Previously we only count 
the numOutputRows in the file scan node, now we count it twice: both in the 
file scan node and the `ColumnarToRowExec`. Since file scan is very critical to 
performance, I think we should avoid any overhead during code refactoring.
   2. Do not track the scanTime metrics in `ColumnarToRowExec`. This metrics is 
specific to file scan, and doesn't make sense for a general batch-to-row 
operator.
   3. Because of 2, we need to track scanTime when building RDDs in the file 
scan node.
   4. use `RDD#mapPartitionsInternal` instead of `flatMap` in several places, 
as `mapPartitionsInternal` is created for Spark SQL and we use it in almost all 
the SQL operators.
   5. Add `limitNotReachedCond` in `ColumnarToRowExec`. This was in the 
`ColumnarBatchScan` before and is critical for performance.
   6. Do not insert `InputAdapter` below `ColumnarToRowExec`. It's weird to use 
`InputAdapter` only for generating the input RDD. `ColumnarToRowExec` can 
generate input RDD itself.
   7. Reuse the `ColumnarBatch` in `RowToColumnarExec`. We don't need to create 
a new one every time, just need to reset it.
   8. Do not skip testing full scan node in `LogicalPlanTagInSparkPlanSuite`
   9. Add back the removed tests in `WholeStageCodegenSuite`.
   
   ## How was this patch tested?
   
   existing tests

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to