zhangjw123321 commented on issue #10418: URL: https://github.com/apache/hudi/issues/10418#issuecomment-1878335425
  set hoodie.spark.sql.insert.into.operation=bulk_insert; set hoodie.bulkinsert.shuffle.parallelism=100; set spark.default.parallelism=100; set spark.sql.shuffle.partitions=100; After these parameters are used, the hdfs hudi file is still 10000 > @zhangjw123321 look like `hoodie.bulkinsert.shuffle.parallelism` can not work on non-partitioned table in the code. In the spark ui, may be you not set `spark.default.parallelism` so `reduceBykey` will use the parent rdd partitions size. Can you try `set spark.default.parallelism=100;` I think it will reduce the parallelism in `stage 10` to 100. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
