[
https://issues.apache.org/jira/browse/HUDI-1668?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17324293#comment-17324293
]
sivabalan narayanan edited comment on HUDI-1668 at 4/17/21, 3:33 PM:
---------------------------------------------------------------------
hudi does have 3 sort modes. global sort, partition sort and none. If you don't
really need any sorting, you can set the sort mode to none.
"hoodie.bulkinsert.sort.mode" is the config of interest.
Default value is "GLOBAL_SORT". Other possible values are "PARTITION_SORT" and
"NONE".
and btw, we have a direct row writing option (w/o converting to Rdd) which is
expected to be ~30% faster than regular bulk insert. Can you give it a shot.
You need to enable this config.
"hoodie.datasource.write.row.writer.enable". This is supported only for
bulk_insert btw.
But do note that this direct row writing option, does global sort by default
and does not support different sort modes for now.
was (Author: shivnarayan):
and btw, we have a direct row writing option (w/o converting to Rdd) which is
expected to be ~30% faster than regular bulk insert. Can you give it a shot.
You need to enable this config.
"hoodie.datasource.write.row.writer.enable". This is supported only for
bulk_insert btw.
> GlobalSortPartitioner is getting called twice during bulk_insert.
> -----------------------------------------------------------------
>
> Key: HUDI-1668
> URL: https://issues.apache.org/jira/browse/HUDI-1668
> Project: Apache Hudi
> Issue Type: Bug
> Reporter: Sugamber
> Priority: Minor
> Labels: sev:high, user-support-issues
> Attachments: 1st.png, 2nd.png, Screen Shot 2021-04-17 at 11.23.17
> AM.png
>
>
> Hi Team,
> I'm using bulk insert option to load close to 2 TB data. The process is
> taking near by 2 hours to get completed. While looking at the job log, it is
> identified that [sortBy at
> GlobalSortPartitioner.java:41|https://gdlcuspc1a3-6.us-central1.us.walmart.net:18481/history/application_1614298633248_1444/1/jobs/job?id=1]
> is running twice.
> It is getting triggered at 1 stage. *refer this screenshot ->[^1st.png]*.
> Second time it is getting trigged from *HoodieSparkSqlWriter.scala:433*
> *[count at
> HoodieSparkSqlWriter.scala:433|https://gdlcuspc1a3-6.us-central1.us.walmart.net:18481/history/application_1614298633248_1444/1/jobs/job?id=2]*
> step.
> In both cases, same number of job got triggered and running time is close to
> each other. *Refer this screenshot* -> [^2nd.png]
> Is there any way to run only one time so that data can be loaded faster or it
> is expected behaviour.
> *Spark and Hudi configurations*
>
> {code:java}
> Spark - 2.3.0
> Scala- 2.11.12
> Hudi - 0.7.0
>
> {code}
>
> Hudi Configuration
> {code:java}
> "hoodie.cleaner.commits.retained" = 2
> "hoodie.bulkinsert.shuffle.parallelism"=2000
> "hoodie.parquet.small.file.limit" = 100000000
> "hoodie.parquet.max.file.size" = 128000000
> "hoodie.index.bloom.num_entries" = 1800000
> "hoodie.bloom.index.filter.type" = "DYNAMIC_V0"
> "hoodie.bloom.index.filter.dynamic.max.entries" = 2500000
> "hoodie.bloom.index.bucketized.checking" = "false"
> "hoodie.datasource.write.operation" = "bulk_insert"
> "hoodie.datasource.write.table.type" = "COPY_ON_WRITE"
> {code}
>
> Spark Configuration -
> {code:java}
> --num-executors 180
> --executor-cores 4
> --executor-memory 16g
> --driver-memory=24g
> --conf spark.rdd.compress=true
> --queue=default
> --conf spark.serializer=org.apache.spark.serializer.KryoSerializer
> --conf spark.executor.memoryOverhead=1600
> --conf spark.driver.memoryOverhead=1200
> --conf spark.driver.maxResultSize=2g
> --conf spark.kryoserializer.buffer.max=512m
> {code}
--
This message was sent by Atlassian Jira
(v8.3.4#803005)