cozos opened a new issue, #24365: URL: https://github.com/apache/beam/issues/24365
### What happened? When running Beam on Spark using `WriteToParquet` without `num_shards`, files seem to be written with no parallelism. In https://beam.apache.org/releases/pydoc/2.11.0/apache_beam.io.parquetio.html it says: ``` num_shards – The number of files (shards) used for output. If not set, the service will decide on the optimal number of shards. ``` However, in Spark, my tasks looks like this:  I believe that this is happening because `iobase.WriteImpl` in [here](https://github.com/apache/beam/blob/master/sdks/python/apache_beam/io/iobase.py#L1156-L1157) is doing: ` `` .... | 'Pair' >> core.Map(lambda x: (None, x)) | core.GroupByKey() ``` which was added in this PR: https://github.com/apache/beam/pull/958 If I understand correctly, the pcollection elements will all have the same key, `None`, and `GroupByKey` will group all those elements into a single "partition" (in Spark terms). This "None" partition is massively skewed and can only be written by 1 thread / task and will take forever. ### Issue Priority Priority: 2 ### Issue Component Component: io-py-parquet -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
