[GitHub] [beam] lukecwik commented on issue #11037: [BEAM-9434] performance improvements reading many Avro files in S3

GitBox Thu, 26 Mar 2020 19:41:25 -0700

lukecwik commented on issue #11037: [BEAM-9434] performance improvements 
reading many Avro files in S3
URL: https://github.com/apache/beam/pull/11037#issuecomment-604785112
 
 
   Sorry about the long delay but **Reshuffle** should produce as many 
partitions as the runner thinks is optimal. It is effectively a 
**redistribute** operation.
   
   It looks like the spark translation is copying the number of partitions from 
the upstream transform for the reshuffle translation and in your case this is 
likely 1. 
   Translation: 
https://github.com/apache/beam/blob/f5a4a5afcd9425c0ddb9ec9c70067a5d5c0bc769/runners/spark/src/main/java/org/apache/beam/runners/spark/translation/TransformTranslator.java#L681
   Copying partitions:
   
https://github.com/apache/beam/blob/f5a4a5afcd9425c0ddb9ec9c70067a5d5c0bc769/runners/spark/src/main/java/org/apache/beam/runners/spark/translation/GroupCombineFunctions.java#L191
   
   @iemejia Shouldn't we be using a much larger value for partitions, e.g. the 
number of nodes?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

[GitHub] [beam] lukecwik commented on issue #11037: [BEAM-9434] performance improvements reading many Avro files in S3

Reply via email to