nandurj commented on issue #1586:
URL: https://github.com/apache/hudi/issues/1586#issuecomment-659633790
I am working with HUDI 0.5.2 on EMR 5.30. I am running the job using the
Delta streamer. Below is how I am running the spark job.
spark-submit --class
org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer \
--jars /usr/lib/spark/external/lib/spark-avro_2.11-2.4.5-amzn-0.jar \
--master yarn --deploy-mode client \
--executor-memory 10G --executor-cores 4 \
file:///usr/lib/hudi/hudi-utilities-bundle_2.11-0.5.2-incubating.jar \
--table-type COPY_ON_WRITE \
--source-ordering-field TIMESTAMP \
--continuous \
--enable-hive-sync \
--min-sync-interval-seconds 60 \
--source-class org.apache.hudi.utilities.sources.ParquetDFSSource \
--transformer-class org.apache.hudi.utilities.transform.AWSDmsTransformer \
--target-base-path s3://mybucket/CoWex --target-table table_test \
--payload-class org.apache.hudi.payload.AWSDmsAvroPayload \
--schemaprovider-class
org.apache.hudi.utilities.schema.FilebasedSchemaProvider \
--hoodie-conf hoodie.datasource.write.recordkey.field="Field1, Field2,
Field3" \
--hoodie-conf
hoodie.datasource.write.keygenerator.class=org.apache.hudi.keygen.ComplexKeyGenerator
\
--hoodie-conf hoodie.datasource.write.partitionpath.field="Field1" \
--hoodie-conf hoodie.datasource.hive_sync.database=testdb \
--hoodie-conf hoodie.datasource.hive_sync.table=test_table\
--hoodie-conf hoodie.datasource.hive_sync.partition_fields="datefield" \
--hoodie-conf
hoodie.datasource.hive_sync.partition_extractor_class=org.apache.hudi.hive.MultiPartKeysValueExtractor
\
--hoodie-conf hoodie.deltastreamer.source.dfs.root=s3://mybucket/input
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]