PavelPetukhov opened a new issue #2959:
URL: https://github.com/apache/hudi/issues/2959


   While working with Hudi 0.7.0 we were able to store data from Kafka topics 
to hdfs
   We tried to migrate to 0.8.0, but we've discovered a strange behavior - 
   spark submit finishes with status SUCCEEDED but no data is actually stored 
in HDFS
   Only .hoodie folder is created is in the desired location with files like 
.aux, .temp, deltacommit.infligh, deltacommit.requested, hoodie.properties, 
archived
   
   Spark Submit looks like this (attached only Hudi related configurations, can 
send full request if necessary):
   (
     Please note that 0.7.0 with the same config worked (data is stored as 
expected), only 
     hudi-utilities-bundle_2.12:0.8.0 changed from 
hudi-utilities-bundle_2.11:0.7.0
     spark-avro_2.12:2.4.7 changed from spark-avro_2.11:2.4.7
     hoodie-utilities.jar taken hudi-0.8.0-utilities-2.12.jar instead of 
hudi-0.7.0-utilities-2.11 
   )
   
   
   /usr/local/spark/bin/spark-submit --conf 
"spark.yarn.submit.waitAppCompletion=false" \
   --packages 
org.apache.hudi:hudi-utilities-bundle_2.12:0.8.0,org.apache.spark:spark-avro_2.12:2.4.7
 \
   --master yarn \
   --deploy-mode cluster \
   --name xxx \
   --class org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer \
   /app/hoodie-utilities.jar \
   --op BULK_INSERT \
   --table-type MERGE_ON_READ \
   --source-class org.apache.hudi.utilities.sources.AvroKafkaSource \
   --source-ordering-field __null_ts_ms \
   --schemaprovider-class 
org.apache.hudi.utilities.schema.SchemaRegistryProvider \
   --enable-hive-sync \
   --target-base-path xxx \
   --target-table xxx \
   --hoodie-conf "hoodie.datasource.hive_sync.enable=true" \
   --hoodie-conf "hoodie.datasource.hive_sync.table=foo" \
   --hoodie-conf "hoodie.datasource.hive_sync.partition_fields=date:TIMESTAMP" \
   --hoodie-conf 
"hoodie.datasource.hive_sync.partition_extractor_class=org.apache.hudi.hive.SlashEncodedDayPartitionValueExtractor"
 \
   --hoodie-conf "hoodie.datasource.hive_sync.jdbcurl=" \
   --hoodie-conf "hoodie.upsert.shuffle.parallelism=2" \
   --hoodie-conf "hoodie.insert.shuffle.parallelism=2" \
   --hoodie-conf "hoodie.delete.shuffle.parallelism=2" \
   --hoodie-conf "hoodie.bulkinsert.shuffle.parallelism=2" \
   --hoodie-conf "hoodie.embed.timeline.server=true" \
   --hoodie-conf "hoodie.filesystem.view.type=EMBEDDED_KV_STORE" \
   --hoodie-conf 
"hoodie.datasource.write.keygenerator.class=org.apache.hudi.keygen.CustomKeyGenerator"
 \
   --hoodie-conf 
"hoodie.deltastreamer.keygen.timebased.timestamp.type=DATE_STRING" \
   --hoodie-conf 
"hoodie.deltastreamer.keygen.timebased.input.dateformat=yyyy-MM-dd'T'HH:mm:ssZ,yyyy-MM-dd'T'HH:mm:ss.SSSZ"
 \
   --hoodie-conf 
"hoodie.deltastreamer.keygen.timebased.input.dateformat.list.delimiter.regex=" \
   --hoodie-conf "hoodie.deltastreamer.keygen.timebased.input.timezone=" \
   --hoodie-conf 
"hoodie.deltastreamer.keygen.timebased.output.dateformat=yyyy/MM/dd" \
   --hoodie-conf "hoodie.deltastreamer.schemaprovider.registry.url=xxx \
   --hoodie-conf "xxx" \
   --hoodie-conf "auto.offset.reset=earliest" \
   --hoodie-conf "group.id=hudi_group" \
   --hoodie-conf "schema.registry.url=xxx" \
   --hoodie-conf "hoodie.parquet.small.file.limit=0" \
   --hoodie-conf "hoodie.clustering.inline=true" \
   --hoodie-conf "hoodie.clustering.inline.max.commits=4" \
   --hoodie-conf 
"hoodie.clustering.plan.strategy.target.file.max.bytes=1073741824" \
   --hoodie-conf "hoodie.clustering.plan.strategy.small.file.limit=629145600" \
   --hoodie-conf "hoodie.datasource.write.recordkey.field=id" \
   --hoodie-conf "hoodie.datasource.write.partitionpath.field=date:TIMESTAMP" \
   --hoodie-conf "hoodie.deltastreamer.source.kafka.topic=xxx" \
   
   * Hudi version : 0.8.0
   
   * Spark version : 2.4.7
   
   * Storage (HDFS/S3/GCS..) : hdfs
   
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]


Reply via email to