[GitHub] [hudi] jeguiguren-cohere opened a new issue, #9791: [SUPPORT] Hudi COW performance issue, bottleneck in "Doin partition and writing data" stage

via GitHub Tue, 26 Sep 2023 16:10:19 -0700


jeguiguren-cohere opened a new issue, #9791:
URL: https://github.com/apache/hudi/issues/9791


   **Describe the problem you faced**
   We are using Hudi on AWS Glue to continuously merge small batches of data to 
bronze tables, and noticing slow write performance in upsert mode to COW table 
(20+ minutes). 
   
   The target table is ~small, approximately 6 million rows x 1000 columns, and 
the incoming batches have less than 50,000 records (which during preCombine 
step are reduced to less than 10,000 unique records). The table is not 
partitioned because it is small, and currently configured with simple global 
index. 
   
   **Expected behavior**
   
   I would expect writes of this size to take a few minutes, similar to vanilla 
Spark job writing to S3 in parquet files. 
   
   **Environment Description**
   
   * Hudi version : 0.12.1
   * Spark version : 3.3
   * Hive version : n/a
   * Hadoop version : n/a
   * Storage (HDFS/S3/GCS..) : S3
   * Running on Docker? (yes/no) : no (AWS Glue 4.0)
   * Resources: 4 G1X workers (1 driver + 3 executors), each has 4 vCPUs & 16 
GB of memory.
   
   **Additional context**
   
   Table config in `/.hoodie/hoodie.properties`:
   ```
   #Updated at 2023-08-14T16:51:53.434Z
   #Mon Aug 14 16:51:53 UTC 2023
   hoodie.table.timeline.timezone=LOCAL
   
hoodie.table.keygenerator.class=org.apache.hudi.keygen.NonpartitionedKeyGenerator
   hoodie.table.precombine.field=clusterTime
   hoodie.table.version=5
   hoodie.database.name=
   hoodie.datasource.write.hive_style_partitioning=false
   hoodie.table.checksum=3456772992
   hoodie.partition.metafile.use.base.format=false
   hoodie.archivelog.folder=archived
   hoodie.table.name=hudi_raw_mytable
   hoodie.populate.meta.fields=true
   hoodie.table.type=COPY_ON_WRITE
   hoodie.datasource.write.partitionpath.urlencode=false
   hoodie.table.base.file.format=PARQUET
   hoodie.datasource.write.drop.partition.columns=false
   hoodie.table.metadata.partitions=
   hoodie.timeline.layout.version=1
   hoodie.table.recordkey.fields=documentKey
   hoodie.table.partition.fields=
   ```
   
   Hudi config:
   ```
    "hoodie.table.name": TABLE,
     "hoodie.datasource.write.recordkey.field": "documentKey",
     "hoodie.datasource.write.precombine.field": "clusterTime",
     "hoodie.datasource.write.reconcile.schema": "false",
     "hoodie.schema.on.read.enable": "true",
     "hoodie.bulkinsert.sort.mode": "GLOBAL_SORT",
     "hoodie.metadata.enable": "false",
     "hoodie.datasource.hive_sync.database": DB_NAME,
     "hoodie.datasource.hive_sync.table": TABLE,
     "hoodie.datasource.hive_sync.use_jdbc": "false",
     "hoodie.datasource.hive_sync.enable": "true",
     "hoodie.datasource.hive_sync.partition_extractor_class": 
"org.apache.hudi.hive.NonPartitionedExtractor",
     "hoodie.datasource.hive_sync.partition_value_extractor": 
"org.apache.hudi.hive.SlashEncodedDayPartitionValueExtractor",
     "hoodie.index.type": "GLOBAL_SIMPLE",
     "hoodie.datasource.write.keygenerator.class": 
"org.apache.hudi.keygen.NonpartitionedKeyGenerator"
   ```
   
   Spark stages show that majority of time (20+ minutes) is spent in "Doing 
partition and writing data"
   
   <img width="1784" alt="Screen Shot 2023-09-26 at 6 58 37 PM" 
src="https://github.com/apache/hudi/assets/67695657/28d814d3-594e-49a8-bb49-8aa60264b967";>
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [hudi] jeguiguren-cohere opened a new issue, #9791: [SUPPORT] Hudi COW performance issue, bottleneck in "Doin partition and writing data" stage

Reply via email to