[GitHub] [spark] viirya commented on pull request #34642: [SPARK-37369][SQL] Avoid redundant ColumnarToRow transistion on InMemoryTableScan

GitBox Tue, 23 Nov 2021 10:55:42 -0800


viirya commented on pull request #34642:
URL: https://github.com/apache/spark/pull/34642#issuecomment-977016485



   > I'm trying to understand the motivation. Is it because in-memory table can 
output rows efficiently? Parquet scan can also output rows but we try our best 
to output columnar batches.
   
   For Parquet scan, when we say it to output columnar batches, actually it 
behaves quite different than row-based approach because it runs vectorized 
Parquet reader. I think this is why we try our best to do columnar batches on 
Parquet or Orc scan because vectorized reader usually has much better 
performance which can counteract the cost of columnar-to-row transition if any 
later.
   
   For in-memory table, it is not actually doing a physical disk scan but the 
data is already serialized in memory. The motivation is that during local 
experiments I found columnar-to-row transition is costly and the columnar 
output looks meaningless.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [spark] viirya commented on pull request #34642: [SPARK-37369][SQL] Avoid redundant ColumnarToRow transistion on InMemoryTableScan

Reply via email to