PavelPetukhov opened a new issue #2959:
URL: https://github.com/apache/hudi/issues/2959
While working with Hudi 0.7.0 we were able to store data from Kafka topics
to hdfs
We tried to migrate to 0.8.0, but we've discovered a strange behavior -
spark submit finishes with status SUCCEEDED but no data is actually stored
in HDFS
Only .hoodie folder is created is in the desired location with files like
.aux, .temp, deltacommit.infligh, deltacommit.requested, hoodie.properties,
archived
Spark Submit looks like this (attached only Hudi related configurations, can
send full request if necessary):
(
Please note that 0.7.0 with the same config worked (data is stored as
expected), only
hudi-utilities-bundle_2.12:0.8.0 changed from
hudi-utilities-bundle_2.11:0.7.0
spark-avro_2.12:2.4.7 changed from spark-avro_2.11:2.4.7
hoodie-utilities.jar taken hudi-0.8.0-utilities-2.12.jar instead of
hudi-0.7.0-utilities-2.11
)
/usr/local/spark/bin/spark-submit --conf
"spark.yarn.submit.waitAppCompletion=false" \
--packages
org.apache.hudi:hudi-utilities-bundle_2.12:0.8.0,org.apache.spark:spark-avro_2.12:2.4.7
\
--master yarn \
--deploy-mode cluster \
--name xxx \
--class org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer \
/app/hoodie-utilities.jar \
--op BULK_INSERT \
--table-type MERGE_ON_READ \
--source-class org.apache.hudi.utilities.sources.AvroKafkaSource \
--source-ordering-field __null_ts_ms \
--schemaprovider-class
org.apache.hudi.utilities.schema.SchemaRegistryProvider \
--enable-hive-sync \
--target-base-path xxx \
--target-table xxx \
--hoodie-conf "hoodie.datasource.hive_sync.enable=true" \
--hoodie-conf "hoodie.datasource.hive_sync.table=foo" \
--hoodie-conf "hoodie.datasource.hive_sync.partition_fields=date:TIMESTAMP" \
--hoodie-conf
"hoodie.datasource.hive_sync.partition_extractor_class=org.apache.hudi.hive.SlashEncodedDayPartitionValueExtractor"
\
--hoodie-conf "hoodie.datasource.hive_sync.jdbcurl=" \
--hoodie-conf "hoodie.upsert.shuffle.parallelism=2" \
--hoodie-conf "hoodie.insert.shuffle.parallelism=2" \
--hoodie-conf "hoodie.delete.shuffle.parallelism=2" \
--hoodie-conf "hoodie.bulkinsert.shuffle.parallelism=2" \
--hoodie-conf "hoodie.embed.timeline.server=true" \
--hoodie-conf "hoodie.filesystem.view.type=EMBEDDED_KV_STORE" \
--hoodie-conf
"hoodie.datasource.write.keygenerator.class=org.apache.hudi.keygen.CustomKeyGenerator"
\
--hoodie-conf
"hoodie.deltastreamer.keygen.timebased.timestamp.type=DATE_STRING" \
--hoodie-conf
"hoodie.deltastreamer.keygen.timebased.input.dateformat=yyyy-MM-dd'T'HH:mm:ssZ,yyyy-MM-dd'T'HH:mm:ss.SSSZ"
\
--hoodie-conf
"hoodie.deltastreamer.keygen.timebased.input.dateformat.list.delimiter.regex=" \
--hoodie-conf "hoodie.deltastreamer.keygen.timebased.input.timezone=" \
--hoodie-conf
"hoodie.deltastreamer.keygen.timebased.output.dateformat=yyyy/MM/dd" \
--hoodie-conf "hoodie.deltastreamer.schemaprovider.registry.url=xxx \
--hoodie-conf "xxx" \
--hoodie-conf "auto.offset.reset=earliest" \
--hoodie-conf "group.id=hudi_group" \
--hoodie-conf "schema.registry.url=xxx" \
--hoodie-conf "hoodie.parquet.small.file.limit=0" \
--hoodie-conf "hoodie.clustering.inline=true" \
--hoodie-conf "hoodie.clustering.inline.max.commits=4" \
--hoodie-conf
"hoodie.clustering.plan.strategy.target.file.max.bytes=1073741824" \
--hoodie-conf "hoodie.clustering.plan.strategy.small.file.limit=629145600" \
--hoodie-conf "hoodie.datasource.write.recordkey.field=id" \
--hoodie-conf "hoodie.datasource.write.partitionpath.field=date:TIMESTAMP" \
--hoodie-conf "hoodie.deltastreamer.source.kafka.topic=xxx" \
* Hudi version : 0.8.0
* Spark version : 2.4.7
* Storage (HDFS/S3/GCS..) : hdfs
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]