Ethan Guo created HUDI-3375:
-------------------------------

             Summary: Investigate deltastreamer continuous mode getting stuck 
when metadata table is enabled
                 Key: HUDI-3375
                 URL: https://issues.apache.org/jira/browse/HUDI-3375
             Project: Apache Hudi
          Issue Type: Task
            Reporter: Ethan Guo
            Assignee: Ethan Guo
             Fix For: 0.11.0


Deltastreamer continuous mode writing MOR table with upserts, with async 
Compaction, Clustering, and Cleaner, archival and metadata table enabled:
{code:java}
/Users/ethan/Work/lib/spark-3.2.0-bin-hadoop3.2/bin/spark-submit \
      --master local[3] \
      --driver-memory 3g --executor-memory 1g --num-executors 3 
--executor-cores 1 \
      --conf spark.serializer=org.apache.spark.serializer.KryoSerializer \
      --conf 
spark.hadoop.fs.s3a.aws.credentials.provider=com.amazonaws.auth.DefaultAWSCredentialsProviderChain
 \
      --conf spark.sql.catalogImplementation=hive \
      --conf spark.driver.maxResultSize=1g \
      --conf spark.speculation=true \
      --conf spark.speculation.multiplier=1.0 \
      --conf spark.speculation.quantile=0.5 \
      --packages org.apache.spark:spark-avro_2.12:3.2.0 \
      --jars 
/Users/ethan/Work/repo/hudi-benchmarks/target/hudi-benchmarks-0.1-SNAPSHOT.jar \
      --class org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer \
      
/Users/ethan/Work/lib/hudi-utilities-bundle_2.12-0.11.0-SNAPSHOT-no-error-inj.jar
 \
      --props 
/Users/ethan/Work/scripts/metadata_test_ds_mor_continuous.properties \
      --source-class BenchmarkDataSource \
      --source-ordering-field ts \
      --target-base-path 
/Users/ethan/Work/data/hudi/metadata_test_ds_mor_continuous_1 \
      --target-table metadata_test_ds_mor_continuous_table_1 \
      --table-type MERGE_ON_READ \
      --op UPSERT \
      --continuous >> metadata_test_ds_mor_continuous_1_output.log 2>&1 {code}
metadata_test_ds_mor_continuous.properties:
{code:java}
hoodie.upsert.shuffle.parallelism=40
hoodie.insert.shuffle.parallelism=40
hoodie.delete.shuffle.parallelism=40
hoodie.bulkinsert.shuffle.parallelism=40
# Key fields, for kafka example
hoodie.datasource.write.recordkey.field=key
hoodie.datasource.write.partitionpath.field=partition
# Schema provider props (change to absolute path based on your installation)
hoodie.deltastreamer.schemaprovider.source.schema.file=file:/Users/ethan/Work/scripts/benchmark_schema.avsc
hoodie.deltastreamer.schemaprovider.target.schema.file=file:/Users/ethan/Work/scripts/benchmark_schema.avsc
# DFS Source
hoodie.deltastreamer.source.dfs.root=file:/Users/ethan/Work/data/hudi/benchmark_sample_upserts2
benchmark.input.source.path=file:/Users/ethan/Work/data/hudi/benchmark_sample_upserts2
# Clustering
hoodie.clustering.async.enabled=true
hoodie.clustering.async.max.commits=6
# Compaction
hoodie.compact.inline.max.delta.commits=3
# Clean and archive
hoodie.clean.async=true
hoodie.keep.max.commits=7
hoodie.keep.min.commits=5
hoodie.cleaner.commits.retained=4
# Concurrency control
hoodie.write.concurrency.mode=optimistic_concurrency_control
hoodie.cleaner.policy.failed.writes=LAZY
hoodie.write.lock.provider=org.apache.hudi.client.transaction.lock.InProcessLockProvider
# Metadata table
hoodie.metadata.compact.max.delta.commits=5
hoodie.metadata.keep.min.commits=8
hoodie.metadata.keep.max.commits=12 {code}
The deltastreamer cannot proceed further after around 50 commits.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

Reply via email to