[GitHub] [beam] cozos opened a new issue, #24365: [Bug]: No parallelism using WriteToParquet in Apache Spark

GitBox Sun, 27 Nov 2022 08:40:31 -0800


cozos opened a new issue, #24365:
URL: https://github.com/apache/beam/issues/24365


   ### What happened?
   
   When running Beam on Spark using `WriteToParquet` without `num_shards`, 
files seem to be written with no parallelism. In 
https://beam.apache.org/releases/pydoc/2.11.0/apache_beam.io.parquetio.html it 
says:
   
   ```
   num_shards – The number of files (shards) used for output. If not set, the 
service will decide on the optimal number of shards.
   ```
   
   However, in Spark, my tasks looks like this:
   ![Screen Shot 2022-11-24 at 11 39 42 
PM](https://user-images.githubusercontent.com/2646862/204147013-8010caf8-a4c8-49ee-baec-2477e521cf80.png)
   
   I believe that this is happening because `iobase.WriteImpl` in 
[here](https://github.com/apache/beam/blob/master/sdks/python/apache_beam/io/iobase.py#L1156-L1157)
 is doing:
   
   ` ``
       ....
       | 'Pair' >> core.Map(lambda x: (None, x))
       | core.GroupByKey()
   ```
   
   which was added in this PR: https://github.com/apache/beam/pull/958
   
   If I understand correctly, the pcollection elements will all have the same 
key, `None`, and `GroupByKey` will group all those elements into a single 
"partition" (in Spark terms). This "None" partition is massively skewed and can 
only be written by 1 thread / task and will take forever.
   
   
   ### Issue Priority
   
   Priority: 2
   
   ### Issue Component
   
   Component: io-py-parquet


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [beam] cozos opened a new issue, #24365: [Bug]: No parallelism using WriteToParquet in Apache Spark

Reply via email to