PavelPetukhov edited a comment on issue #2959:
URL: https://github.com/apache/hudi/issues/2959#issuecomment-848885930


   @n3nash 
   
   .hoodie directory structure is the following
   hdfs dfs -ls /user/hdfs/raw_data/public/ml_training_data/foo/.hoodie
   Found 7 items
   drwxr-xr-x   - hdfs hadoop          0 2021-05-26 18:33 
/path_to_location/foo/.hoodie/.aux
   drwxr-xr-x   - hdfs hadoop          0 2021-05-26 18:33 
/path_to_location/foo/.hoodie/.temp
   drwxr-xr-x   - hdfs hadoop          0 2021-05-26 18:33 
/path_to_location/foo/.hoodie/20210526183328.deltacommit
   -rw-r--r--   3 hdfs hadoop        518 2021-05-26 18:33 
/path_to_location/foo/.hoodie/20210526183328.deltacommit.inflight
   -rw-r--r--   3 hdfs hadoop          0 2021-05-26 18:33 
/path_to_location/foo/.hoodie/20210526183328.deltacommit.requested
   drwxr-xr-x   - hdfs hadoop          0 2021-05-26 18:33 
/path_to_location/foo/.hoodie/archived
   -rw-r--r--   3 hdfs hadoop        391 2021-05-26 18:33 
/path_to_location/foo/.hoodie/hoodie.properties
   
   
   Also, I have removed everything unrelated, so the request looks like this:
   
   /usr/local/spark/bin/spark-submit --conf 
"spark.yarn.submit.waitAppCompletion=false" \
   --conf "spark.dynamicAllocation.minExecutors=1" \
   --conf "spark.dynamicAllocation.maxExecutors=10" \
   --conf "spark.dynamicAllocation.enabled=true" \
   --conf "spark.dynamicAllocation.shuffleTracking.enabled=true" \
   --conf "spark.shuffle.service.enabled=true" \
   --conf "spark.eventLog.enabled=true" \
   --conf "spark.eventLog.dir=hdfs://xxx/eventLogging" \
   --conf "spark.executor.memoryOverhead=384" \
   --conf "spark.driver.memoryOverhead=384" \
   --conf "spark.driver.extraJavaOptions=-DsparkAappName=xxx 
-DlogIndex=GOLANG_JSON -DappName=data-lake-extractors-streamer 
-DlogFacility=stdout" \
   --packages org.apache.spark:spark-avro_2.12:2.4.7 \
   --master yarn \
   --deploy-mode cluster \
   --name xxx \
   --driver-memory 2G \
   --executor-memory 2G \
   --class org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer \
   hdfs://xxx/user/hudi/hudi-utilities-bundle_2.12-0.8.0.jar \
   --op UPSERT \
   --table-type MERGE_ON_READ \
   --source-class org.apache.hudi.utilities.sources.AvroKafkaSource \
   --source-ordering-field __null_ts_ms \
   --schemaprovider-class 
org.apache.hudi.utilities.schema.SchemaRegistryProvider \
   --target-base-path /user/hdfs/raw_data/public/xxx/yyy \
   --target-table xxx \
   --hoodie-conf 
"hoodie.datasource.write.keygenerator.class=org.apache.hudi.keygen.CustomKeyGenerator"
 \
   --hoodie-conf 
"hoodie.deltastreamer.keygen.timebased.timestamp.type=DATE_STRING" \
   --hoodie-conf 
"hoodie.deltastreamer.keygen.timebased.output.dateformat=yyyy/MM/dd" \
   --hoodie-conf 
"hoodie.deltastreamer.keygen.timebased.input.dateformat=yyyy-MM-ddTHH:mm:ssZ,yyyy-MM-ddTHH:mm:ss.SSSZ"
 \
   --hoodie-conf 
"hoodie.deltastreamer.keygen.timebased.input.dateformat.list.delimiter.regex=" \
   --hoodie-conf "hoodie.deltastreamer.keygen.timebased.input.timezone=" \
   --hoodie-conf "hoodie.upsert.shuffle.parallelism=2" \
   --hoodie-conf "hoodie.insert.shuffle.parallelism=2" \
   --hoodie-conf "hoodie.delete.shuffle.parallelism=2" \
   --hoodie-conf "hoodie.bulkinsert.shuffle.parallelism=2" \
   --hoodie-conf "hoodie.embed.timeline.server=true" \
   --hoodie-conf "hoodie.filesystem.view.type=EMBEDDED_KV_STORE" \
   --hoodie-conf 
"hoodie.deltastreamer.schemaprovider.registry.url=http://xxx/subjects/xxx-value/versions/latest";
 \
   --hoodie-conf "bootstrap.servers=xxx" \
   --hoodie-conf "auto.offset.reset=earliest" \
   --hoodie-conf "group.id=hudi_group" \
   --hoodie-conf "schema.registry.url=http://xxx"; \
   --hoodie-conf "hoodie.datasource.write.recordkey.field=id" \
   --hoodie-conf "hoodie.datasource.write.partitionpath.field=date:TIMESTAMP" \
   --hoodie-conf "hoodie.deltastreamer.source.kafka.topic=xxx" \
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]


Reply via email to