lukecwik commented on issue #11037: [BEAM-9434] performance improvements reading many Avro files in S3 URL: https://github.com/apache/beam/pull/11037#issuecomment-604785112 Sorry about the long delay but **Reshuffle** should produce as many partitions as the runner thinks is optimal. It is effectively a **redistribute** operation. It looks like the spark translation is copying the number of partitions from the upstream transform for the reshuffle translation and in your case this is likely 1. Translation: https://github.com/apache/beam/blob/f5a4a5afcd9425c0ddb9ec9c70067a5d5c0bc769/runners/spark/src/main/java/org/apache/beam/runners/spark/translation/TransformTranslator.java#L681 Copying partitions: https://github.com/apache/beam/blob/f5a4a5afcd9425c0ddb9ec9c70067a5d5c0bc769/runners/spark/src/main/java/org/apache/beam/runners/spark/translation/GroupCombineFunctions.java#L191 @iemejia Shouldn't we be using a much larger value for partitions, e.g. the number of nodes?
---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected] With regards, Apache Git Services
