[GitHub] [beam] cozos commented on issue #24365: [Bug]: No parallelism using WriteToParquet in Apache Spark

GitBox Fri, 02 Dec 2022 14:03:09 -0800


cozos commented on issue #24365:
URL: https://github.com/apache/beam/issues/24365#issuecomment-1335895475


   Upon thinking about this further I think the bottleneck came from the reader 
problem I had in here https://github.com/apache/beam/issues/24422
   
   Basically all Reads only happen on 1 partition on runners that don't support 
SDF. But this issue was being obscured by Spark UI stage showing the job stuck 
at the shuffle boundary which came from WriteToParquet (when in reality the 
bottleneck was at the Read). 
   
   We can close this issue but I don't know if the `GroupByKey` on `None` is 
still a problem we want to track (as it could also cause a bottleneck).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [beam] cozos commented on issue #24365: [Bug]: No parallelism using WriteToParquet in Apache Spark

Reply via email to