jeguiguren-cohere opened a new issue, #9791:
URL: https://github.com/apache/hudi/issues/9791
**Describe the problem you faced**
We are using Hudi on AWS Glue to continuously merge small batches of data to
bronze tables, and noticing slow write performance in upsert mode to COW table
(20+ minutes).
The target table is ~small, approximately 6 million rows x 1000 columns, and
the incoming batches have less than 50,000 records (which during preCombine
step are reduced to less than 10,000 unique records). The table is not
partitioned because it is small, and currently configured with simple global
index.
**Expected behavior**
I would expect writes of this size to take a few minutes, similar to vanilla
Spark job writing to S3 in parquet files.
**Environment Description**
* Hudi version : 0.12.1
* Spark version : 3.3
* Hive version : n/a
* Hadoop version : n/a
* Storage (HDFS/S3/GCS..) : S3
* Running on Docker? (yes/no) : no (AWS Glue 4.0)
* Resources: 4 G1X workers (1 driver + 3 executors), each has 4 vCPUs & 16
GB of memory.
**Additional context**
Table config in `/.hoodie/hoodie.properties`:
```
#Updated at 2023-08-14T16:51:53.434Z
#Mon Aug 14 16:51:53 UTC 2023
hoodie.table.timeline.timezone=LOCAL
hoodie.table.keygenerator.class=org.apache.hudi.keygen.NonpartitionedKeyGenerator
hoodie.table.precombine.field=clusterTime
hoodie.table.version=5
hoodie.database.name=
hoodie.datasource.write.hive_style_partitioning=false
hoodie.table.checksum=3456772992
hoodie.partition.metafile.use.base.format=false
hoodie.archivelog.folder=archived
hoodie.table.name=hudi_raw_mytable
hoodie.populate.meta.fields=true
hoodie.table.type=COPY_ON_WRITE
hoodie.datasource.write.partitionpath.urlencode=false
hoodie.table.base.file.format=PARQUET
hoodie.datasource.write.drop.partition.columns=false
hoodie.table.metadata.partitions=
hoodie.timeline.layout.version=1
hoodie.table.recordkey.fields=documentKey
hoodie.table.partition.fields=
```
Hudi config:
```
"hoodie.table.name": TABLE,
"hoodie.datasource.write.recordkey.field": "documentKey",
"hoodie.datasource.write.precombine.field": "clusterTime",
"hoodie.datasource.write.reconcile.schema": "false",
"hoodie.schema.on.read.enable": "true",
"hoodie.bulkinsert.sort.mode": "GLOBAL_SORT",
"hoodie.metadata.enable": "false",
"hoodie.datasource.hive_sync.database": DB_NAME,
"hoodie.datasource.hive_sync.table": TABLE,
"hoodie.datasource.hive_sync.use_jdbc": "false",
"hoodie.datasource.hive_sync.enable": "true",
"hoodie.datasource.hive_sync.partition_extractor_class":
"org.apache.hudi.hive.NonPartitionedExtractor",
"hoodie.datasource.hive_sync.partition_value_extractor":
"org.apache.hudi.hive.SlashEncodedDayPartitionValueExtractor",
"hoodie.index.type": "GLOBAL_SIMPLE",
"hoodie.datasource.write.keygenerator.class":
"org.apache.hudi.keygen.NonpartitionedKeyGenerator"
```
Spark stages show that majority of time (20+ minutes) is spent in "Doing
partition and writing data"
<img width="1784" alt="Screen Shot 2023-09-26 at 6 58 37 PM"
src="https://github.com/apache/hudi/assets/67695657/28d814d3-594e-49a8-bb49-8aa60264b967">
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]