[GitHub] [spark] sadikovi commented on pull request #39950: [SPARK-42388][SQL] Avoid parquet footer reads twice in vectorized reader

via GitHub Sun, 23 Apr 2023 21:00:44 -0700


sadikovi commented on PR #39950:
URL: https://github.com/apache/spark/pull/39950#issuecomment-1519349584


   @yabola @sunchao  Could you share any benchmark numbers for the second 
optimisation of reading all row groups for each task? My concern is that it 
could be suboptimal in performance when you have, let's say, 100 row groups in 
a file, you create 100 tasks for each row group but then you read the full 
footer with all of the row groups for every task just to process one row group.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [spark] sadikovi commented on pull request #39950: [SPARK-42388][SQL] Avoid parquet footer reads twice in vectorized reader

Reply via email to