[
https://issues.apache.org/jira/browse/SPARK-32966?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17200571#comment-17200571
]
Takeshi Yamamuro commented on SPARK-32966:
------------------------------------------
Is this a question? At least, I think you need to describe more info (e.g., a
complete query to reproduce the issue).
> Spark| PartitionBy is taking long time to process
> -------------------------------------------------
>
> Key: SPARK-32966
> URL: https://issues.apache.org/jira/browse/SPARK-32966
> Project: Spark
> Issue Type: Improvement
> Components: PySpark
> Affects Versions: 2.4.5
> Environment: EMR - 5.30.0; Hadoop -2.8.5; Spark- 2.4.5
> Reporter: Sujit Das
> Priority: Major
> Labels: AWS, pyspark, spark-conf
>
> # When I do a write without any partition it takes 8 min
> df2_merge.write.mode('overwrite').parquet(dest_path)
>
> 2. I have added conf -
> spark.sql.sources.partitionOverwriteMode=dynamic ; it took a longer time
> (more than 50 min before I force terminated the EMR cluster). But I have
> observed the partitions have been created and data files are present. But in
> EMR cluster the process is still showing as running, where as in spark
> history server it is showing no running or pending process.
> df2_merge.write.mode('overwrite').partitionBy("posted_on").parquet(dest_path_latest)
>
> 3. I have modified with new conf - spark.sql.shuffle.partitions=3; it
> took 24 min
> df2_merge.coalesce(3).write.mode('overwrite').partitionBy("posted_on").parquet(dest_path_latest)
>
> 4. Again I disabled the conf and run plain write with partition. It took
> 30 min.
> df2_merge.coalesce(3).write.mode('overwrite').partitionBy("posted_on").parquet(dest_path_latest)
>
> Only one conf is common in the above scenarios is
> spark.sql.adaptive.coalescePartitions.initialPartitionNum=100
> My point is to reduce the time of writing with partitionBy. Is there anything
> I am missing
>
>
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]