Can you share some relevant source code?
> Am 05.11.2018 um 07:58 schrieb ehbhaskar <ehbhas...@gmail.com>: > > I have a pyspark job that inserts data into hive partitioned table using > `Insert Overwrite` statement. > > Spark job loads data quickly (in 15 mins) to temp directory (~/.hive-***) in > S3. But, it's very slow in moving data from temp directory to the target > path, it takes more than 40 mins to move data from temp to target path. > > I set the option mapreduce.fileoutputcommitter.algorithm.version=2 (default > is 1) but still I see no change. > > *Are there any ways to improve the performance of hive INSERT OVERWRITE > query from spark?* > > Also, I noticed that this behavior is even worse (i.e. job takes even more > time) with hive table that has too many existing partitions. i.e. The data > loads relatively fast into table that have less existing partitions. > > *Some additional details:* > * Table is a dynamic partitioned table. > * Spark version - 2.3.0 > * Hive version - 2.3.2-amzn-2 > * Hadoop version - 2.8.3-amzn-0 > > PS: Other config options I have tried that didn't have much effect on the > job performance. > * "hive.load.dynamic.partitions.thread - "10" > * "hive.mv.files.thread" - "30" > * "fs.trash.interval" - "0". > > > > -- > Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/ > > --------------------------------------------------------------------- > To unsubscribe e-mail: user-unsubscr...@spark.apache.org > --------------------------------------------------------------------- To unsubscribe e-mail: user-unsubscr...@spark.apache.org