Sugamber created HUDI-1668:
------------------------------
Summary: GlobalSortPartitioner is getting called twice during
bulk_insert.
Key: HUDI-1668
URL: https://issues.apache.org/jira/browse/HUDI-1668
Project: Apache Hudi
Issue Type: Bug
Reporter: Sugamber
Attachments: 1st.png, 2nd.png
Hi Team,
I'm using bulk insert option to load close to 2 TB data. The process is taking
near by 2 hours to get completed. While looking at the job log, it is
identified that [sortBy at
GlobalSortPartitioner.java:41|https://gdlcuspc1a3-6.us-central1.us.walmart.net:18481/history/application_1614298633248_1444/1/jobs/job?id=1]
is running twice.
It is getting triggered at 1 stage. refer this screenshot.
Second time it is getting trigged from *HoodieSparkSqlWriter.scala:433*
*[count at
HoodieSparkSqlWriter.scala:433|https://gdlcuspc1a3-6.us-central1.us.walmart.net:18481/history/application_1614298633248_1444/1/jobs/job?id=2]*
step.
In both cases, same number of job got triggered and running time is close to
each other.
Is there any way to run only one time so that data can be loaded faster.
*Spark and Hudi configurations*
{code:java}
Spark - 2.3.0
Scala- 2.11.12
Hudi - 0.7.0
{code}
Hudi Configuration
{code:java}
"hoodie.cleaner.commits.retained" = 2
"hoodie.bulkinsert.shuffle.parallelism"=2000
"hoodie.parquet.small.file.limit" = 100000000
"hoodie.parquet.max.file.size" = 128000000
"hoodie.index.bloom.num_entries" = 1800000
"hoodie.bloom.index.filter.type" = "DYNAMIC_V0"
"hoodie.bloom.index.filter.dynamic.max.entries" = 2500000
"hoodie.bloom.index.bucketized.checking" = "false"
"hoodie.datasource.write.operation" = "bulk_insert"
"hoodie.datasource.write.table.type" = "COPY_ON_WRITE"
{code}
Spark Configuration -
{code:java}
--num-executors 180
--executor-cores 4
--executor-memory 16g
--driver-memory=24g
--conf spark.rdd.compress=true
--queue=default
--conf spark.serializer=org.apache.spark.serializer.KryoSerializer
--conf spark.executor.memoryOverhead=1600
--conf spark.driver.memoryOverhead=1200
--conf spark.driver.maxResultSize=2g
--conf spark.kryoserializer.buffer.max=512m
{code}
--
This message was sent by Atlassian Jira
(v8.3.4#803005)