fzong76 opened a new issue, #5735:
URL: https://github.com/apache/hudi/issues/5735

   I am using deltastreamer to load files uploaded to s3 bucket. Everything 
works fine with `--class 
org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer`, but failed with 
`--class org.apache.hudi.utilities.deltastreamer.HoodieMultiTableDeltaStreamer` 
when trying to load multiple tables.
   
   Here is part of the log. Seems it processed the data, but I don't see any 
dataset created.
   
   ```
   22/06/01 22:57:29 INFO TaskSetManager: Starting task 197.0 in stage 23.0 
(TID 1802) (ip-172-31-20-205.ec2.internal, executor 2, partition 197, 
PROCESS_LOCAL, 4282 bytes) taskResourceAssignments Map()
   22/06/01 22:57:29 INFO TaskSetManager: Finished task 189.0 in stage 23.0 
(TID 1794) in 7 ms on ip-172-31-20-205.ec2.internal (executor 2) (190/200)
   22/06/01 22:57:29 INFO TaskSetManager: Finished task 190.0 in stage 23.0 
(TID 1795) in 7 ms on ip-172-31-20-205.ec2.internal (executor 2) (191/200)
   22/06/01 22:57:29 INFO TaskSetManager: Starting task 198.0 in stage 23.0 
(TID 1803) (ip-172-31-20-205.ec2.internal, executor 2, partition 198, 
PROCESS_LOCAL, 4282 bytes) taskResourceAssignments Map()
   22/06/01 22:57:29 INFO TaskSetManager: Starting task 199.0 in stage 23.0 
(TID 1804) (ip-172-31-30-20.ec2.internal, executor 1, partition 199, 
PROCESS_LOCAL, 4282 bytes) taskResourceAssignments Map()
   22/06/01 22:57:29 INFO TaskSetManager: Finished task 192.0 in stage 23.0 
(TID 1797) in 7 ms on ip-172-31-30-20.ec2.internal (executor 1) (192/200)
   22/06/01 22:57:29 INFO TaskSetManager: Finished task 191.0 in stage 23.0 
(TID 1796) in 7 ms on ip-172-31-30-20.ec2.internal (executor 1) (193/200)
   22/06/01 22:57:29 INFO TaskSetManager: Finished task 196.0 in stage 23.0 
(TID 1801) in 4 ms on ip-172-31-30-20.ec2.internal (executor 1) (194/200)
   22/06/01 22:57:29 INFO TaskSetManager: Finished task 195.0 in stage 23.0 
(TID 1800) in 5 ms on ip-172-31-30-20.ec2.internal (executor 1) (195/200)
   22/06/01 22:57:29 INFO TaskSetManager: Finished task 194.0 in stage 23.0 
(TID 1799) in 7 ms on ip-172-31-20-205.ec2.internal (executor 2) (196/200)
   22/06/01 22:57:29 INFO TaskSetManager: Finished task 199.0 in stage 23.0 
(TID 1804) in 3 ms on ip-172-31-30-20.ec2.internal (executor 1) (197/200)
   22/06/01 22:57:29 INFO TaskSetManager: Finished task 197.0 in stage 23.0 
(TID 1802) in 7 ms on ip-172-31-20-205.ec2.internal (executor 2) (198/200)
   22/06/01 22:57:29 INFO TaskSetManager: Finished task 198.0 in stage 23.0 
(TID 1803) in 7 ms on ip-172-31-20-205.ec2.internal (executor 2) (199/200)
   22/06/01 22:57:29 INFO TaskSetManager: Finished task 193.0 in stage 23.0 
(TID 1798) in 11 ms on ip-172-31-20-205.ec2.internal (executor 2) (200/200)
   22/06/01 22:57:29 INFO YarnScheduler: Removed TaskSet 23.0, whose tasks have 
all completed, from pool
   22/06/01 22:57:29 INFO DAGScheduler: ResultStage 23 (collect at 
SparkRejectUpdateStrategy.java:52) finished in 0.221 s
   22/06/01 22:57:29 INFO DAGScheduler: Job 8 is finished. Cancelling potential 
speculative or zombie tasks for this job
   22/06/01 22:57:29 INFO YarnScheduler: Killing all running tasks in stage 23: 
Stage finished
   22/06/01 22:57:29 INFO DAGScheduler: Job 8 finished: collect at 
SparkRejectUpdateStrategy.java:52, took 0.468092 s
   22/06/01 22:57:29 INFO SparkContext: Starting job: sum at DeltaSync.java:557
   22/06/01 22:57:29 INFO DAGScheduler: Job 9 finished: sum at 
DeltaSync.java:557, took 0.000174 s
   22/06/01 22:57:29 INFO SparkContext: Starting job: sum at DeltaSync.java:558
   22/06/01 22:57:29 INFO DAGScheduler: Job 10 finished: sum at 
DeltaSync.java:558, took 0.000157 s
   22/06/01 22:57:29 INFO SparkContext: Starting job: collect at 
SparkRDDWriteClient.java:124
   22/06/01 22:57:29 INFO DAGScheduler: Job 11 finished: collect at 
SparkRDDWriteClient.java:124, took 0.000179 s
   22/06/01 22:57:29 INFO S3NativeFileSystem: Opening 
's3://fei-hudi-demo/s3_hudi_table/.hoodie/hoodie.properties' for reading
   22/06/01 22:57:29 INFO S3NativeFileSystem: Opening 
's3://fei-hudi-demo/s3_hudi_table/.hoodie/hoodie.properties' for reading
   22/06/01 22:57:29 INFO MultipartUploadOutputStream: close closed:false 
s3://fei-hudi-demo/s3_hudi_table/.hoodie/20220601225723633.commit
   22/06/01 22:57:29 INFO S3NativeFileSystem: Opening 
's3://fei-hudi-demo/s3_hudi_table/.hoodie/hoodie.properties' for reading
   22/06/01 22:57:29 INFO S3NativeFileSystem: Opening 
's3://fei-hudi-demo/s3_hudi_table/.hoodie/hoodie.properties' for reading
   22/06/01 22:57:30 INFO S3NativeFileSystem: Opening 
's3://fei-hudi-demo/s3_hudi_table/.hoodie/hoodie.properties' for reading
   22/06/01 22:57:30 INFO MapPartitionsRDD: Removing RDD 73 from persistence 
list
   22/06/01 22:57:30 INFO BlockManager: Removing RDD 73
   22/06/01 22:57:30 INFO MapPartitionsRDD: Removing RDD 60 from persistence 
list
   22/06/01 22:57:30 INFO BlockManager: Removing RDD 60
   22/06/01 22:57:30 INFO S3NativeFileSystem: Opening 
's3://fei-hudi-demo/s3_hudi_table/.hoodie/hoodie.properties' for reading
   22/06/01 22:57:30 INFO S3NativeFileSystem: Opening 
's3://fei-hudi-demo/s3_hudi_table/.hoodie/20220601225723633.commit' for reading
   22/06/01 22:57:30 INFO S3NativeFileSystem: Opening 
's3://fei-hudi-demo/s3_meta_table/.hoodie/hoodie.properties' for reading
   22/06/01 22:57:30 WARN S3EventsHoodieIncrSource: Already caught up. Begin 
Checkpoint was :20220601225531374
   ```
   
   Here is the command I used
   ```
   spark-submit \
   --jars 
"/usr/lib/spark/external/lib/spark-avro.jar,s3://fei-hudi-demo/aws-java-sdk-sqs-1.12.220.jar"
 \
   --master yarn --deploy-mode client \
   --class 
org.apache.hudi.utilities.deltastreamer.HoodieMultiTableDeltaStreamer 
/usr/lib/hudi/hudi-utilities-bundle.jar \
   --table-type COPY_ON_WRITE \
   --enable-hive-sync \
   --source-ordering-field metricvalue --base-path-prefix 
s3://fei-hudi-demo/s3_hudi_table \
   --target-table dummy_table --continuous --min-sync-interval-seconds 10 \
   --props s3://fei-hudi-demo/hudi-ingestion-config/source.properties \
   --config-folder s3://fei-hudi-demo/hudi-ingestion-config \
   --source-class org.apache.hudi.utilities.sources.S3EventsHoodieIncrSource 
   ```
   
   source.properties
   ```
   hoodie.deltastreamer.ingestion.tablesToBeIngested=fei_hudi_test.table1
   
hoodie.deltastreamer.ingestion.fei_hudi_test.table1.configFile=s3://fei-hudi-demo/hudi-ingestion-config/t1.properties
   
   
hoodie.datasource.write.keygenerator.class=org.apache.hudi.keygen.CustomKeyGenerator
   
hoodie.datasource.hive_sync.partition_extractor_class=org.apache.hudi.hive.MultiPartKeysValueExtractor
   hoodie.datasource.write.hive_style_partitioning=true 
   hoodie.datasource.hive_sync.database=fei_hudi_test 
   hoodie.deltastreamer.source.hoodieincr.read_latest_on_missing_ckpt=true 
   hoodie.deltastreamer.source.s3incr.key.prefix=""
   ```
   
   t1.properties
   ```
   hoodie.datasource.write.recordkey.field=launchId
   
hoodie.datasource.write.partitionpath.field=appPlatform:simple,appVersion:simple
   hoodie.datasource.hive_sync.partition_fields=appPlatform,appVersion
   hoodie.deltastreamer.source.hoodieincr.path=s3://fei-hudi-demo/s3_meta_table
   ```
   
   Thanks.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to