[jira] [Commented] (SPARK-32966) Spark| PartitionBy is taking long time to process

Takeshi Yamamuro (Jira) Tue, 22 Sep 2020 22:43:11 -0700


    [ 
https://issues.apache.org/jira/browse/SPARK-32966?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17200571#comment-17200571
 ]


Takeshi Yamamuro commented on SPARK-32966:
------------------------------------------

Is this a question? At least, I think you need to describe more info (e.g., a 
complete query to reproduce the issue).

> Spark| PartitionBy is taking long time to process
> -------------------------------------------------
>
>                 Key: SPARK-32966
>                 URL: https://issues.apache.org/jira/browse/SPARK-32966
>             Project: Spark
>          Issue Type: Improvement
>          Components: PySpark
>    Affects Versions: 2.4.5
>         Environment: EMR - 5.30.0; Hadoop -2.8.5; Spark- 2.4.5
>            Reporter: Sujit Das
>            Priority: Major
>              Labels: AWS, pyspark, spark-conf
>
> # When I do a write without any partition it takes 8 min
> df2_merge.write.mode('overwrite').parquet(dest_path)
>  
>        2. I have added conf - 
> spark.sql.sources.partitionOverwriteMode=dynamic ; it took a longer time 
> (more than 50 min before I force terminated the EMR cluster). But I have 
> observed the partitions have been created and data files are present. But in 
> EMR cluster the process is still showing as running, where as in spark 
> history server it is showing no running or pending process.
> df2_merge.write.mode('overwrite').partitionBy("posted_on").parquet(dest_path_latest)
>  
>       3. I have modified with new conf - spark.sql.shuffle.partitions=3; it 
> took 24 min
> df2_merge.coalesce(3).write.mode('overwrite').partitionBy("posted_on").parquet(dest_path_latest)
>  
>      4. Again I disabled the conf and run plain write with partition. It took 
> 30 min.
> df2_merge.coalesce(3).write.mode('overwrite').partitionBy("posted_on").parquet(dest_path_latest)
>  
> Only one conf is common in the above scenarios is 
> spark.sql.adaptive.coalescePartitions.initialPartitionNum=100
> My point is to reduce the time of writing with partitionBy. Is there anything 
> I am missing
>  
>    



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-32966) Spark| PartitionBy is taking long time to process

Reply via email to