PavelPetukhov edited a comment on issue #2959: URL: https://github.com/apache/hudi/issues/2959#issuecomment-848885930
.hoodie directory structure is the following hdfs dfs -ls /user/hdfs/raw_data/public/ml_training_data/foo/.hoodie Found 7 items drwxr-xr-x - hdfs hadoop 0 2021-05-26 18:33 /path_to_location/foo/.hoodie/.aux drwxr-xr-x - hdfs hadoop 0 2021-05-26 18:33 /path_to_location/foo/.hoodie/.temp drwxr-xr-x - hdfs hadoop 0 2021-05-26 18:33 /path_to_location/foo/.hoodie/20210526183328.deltacommit -rw-r--r-- 3 hdfs hadoop 518 2021-05-26 18:33 /path_to_location/foo/.hoodie/20210526183328.deltacommit.inflight -rw-r--r-- 3 hdfs hadoop 0 2021-05-26 18:33 /path_to_location/foo/.hoodie/20210526183328.deltacommit.requested drwxr-xr-x - hdfs hadoop 0 2021-05-26 18:33 /path_to_location/foo/.hoodie/archived -rw-r--r-- 3 hdfs hadoop 391 2021-05-26 18:33 /path_to_location/foo/.hoodie/hoodie.properties Also, I have removed everything unrelated, so the request looks like this: /usr/local/spark/bin/spark-submit --conf "spark.yarn.submit.waitAppCompletion=false" \ --conf "spark.dynamicAllocation.minExecutors=1" \ --conf "spark.dynamicAllocation.maxExecutors=10" \ --conf "spark.dynamicAllocation.enabled=true" \ --conf "spark.dynamicAllocation.shuffleTracking.enabled=true" \ --conf "spark.shuffle.service.enabled=true" \ --conf "spark.eventLog.enabled=true" \ --conf "spark.eventLog.dir=hdfs://xxx/eventLogging" \ --conf "spark.executor.memoryOverhead=384" \ --conf "spark.driver.memoryOverhead=384" \ --conf "spark.driver.extraJavaOptions=-DsparkAappName=xxx -DlogIndex=GOLANG_JSON -DappName=data-lake-extractors-streamer -DlogFacility=stdout" \ --packages org.apache.spark:spark-avro_2.12:2.4.7 \ --master yarn \ --deploy-mode cluster \ --name xxx \ --driver-memory 2G \ --executor-memory 2G \ --class org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer \ hdfs://xxx/user/hudi/hudi-utilities-bundle_2.12-0.8.0.jar \ --op UPSERT \ --table-type MERGE_ON_READ \ --source-class org.apache.hudi.utilities.sources.AvroKafkaSource \ --source-ordering-field __null_ts_ms \ --schemaprovider-class org.apache.hudi.utilities.schema.SchemaRegistryProvider \ --target-base-path /user/hdfs/raw_data/public/xxx/yyy \ --target-table xxx \ --hoodie-conf "hoodie.datasource.write.keygenerator.class=org.apache.hudi.keygen.CustomKeyGenerator" \ --hoodie-conf "hoodie.deltastreamer.keygen.timebased.timestamp.type=DATE_STRING" \ --hoodie-conf "hoodie.deltastreamer.keygen.timebased.output.dateformat=yyyy/MM/dd" \ --hoodie-conf "hoodie.deltastreamer.keygen.timebased.input.dateformat=yyyy-MM-ddTHH:mm:ssZ,yyyy-MM-ddTHH:mm:ss.SSSZ" \ --hoodie-conf "hoodie.deltastreamer.keygen.timebased.input.dateformat.list.delimiter.regex=" \ --hoodie-conf "hoodie.deltastreamer.keygen.timebased.input.timezone=" \ --hoodie-conf "hoodie.upsert.shuffle.parallelism=2" \ --hoodie-conf "hoodie.insert.shuffle.parallelism=2" \ --hoodie-conf "hoodie.delete.shuffle.parallelism=2" \ --hoodie-conf "hoodie.bulkinsert.shuffle.parallelism=2" \ --hoodie-conf "hoodie.embed.timeline.server=true" \ --hoodie-conf "hoodie.filesystem.view.type=EMBEDDED_KV_STORE" \ --hoodie-conf "hoodie.deltastreamer.schemaprovider.registry.url=http://xxx/subjects/xxx-value/versions/latest" \ --hoodie-conf "bootstrap.servers=xxx" \ --hoodie-conf "auto.offset.reset=earliest" \ --hoodie-conf "group.id=hudi_group" \ --hoodie-conf "schema.registry.url=http://xxx" \ --hoodie-conf "hoodie.datasource.write.recordkey.field=id" \ --hoodie-conf "hoodie.datasource.write.partitionpath.field=date:TIMESTAMP" \ --hoodie-conf "hoodie.deltastreamer.source.kafka.topic=xxx" \ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected]
