SubashRanganathan opened a new issue, #6304:
URL: https://github.com/apache/hudi/issues/6304
Issue: Hudi MultiTable Deltastreamer not updating glue catalog when new
column added on Source.
Hudi Version: 0.11
Description:
Hudi Multi Table deltaStreamer job does not update AWS Glue Catalog when new
column is added on files ingested by DMS. The following property is used in the
hudi configurations.
"hoodie.meta.sync.client.tool.class=org.apache.hudi.aws.sync.AwsGlueCatalogSyncTool"
The hudi table.properties are as follows:
hoodie.datasource.write.recordkey.field=col_name
hoodie.datasource.write.precombine.field=ts
hoodie.datasource.write.partitionpath_field_opt_key=""
hoodie.deltastreamer.ingestion.targetBasePath=s3://hudi_s3_bucket_name/hudi/hudi_table_name/
hoodie.deltastreamer.source.dfs.root=s3://s3_dms_landing_bucket_name/source_folder
hoodie.avro.schema.validate=false
hoodie.datasource.write.keygenerator.class=org.apache.hudi.keygen.NonpartitionedKeyGenerator
hoodie.datasource.hive_sync.enable=true
hoodie.datasource.hive_sync.auto_create_database=true
hoodie.datasource.hive_sync.database=database_name
hoodie.datasource.hive_sync.table=table_name
hoodie.datasource.hive.partition_fields_opt_key=""
hoodie.schema.on.read.enable=true
hoodie.meta.sync.client.tool.class=org.apache.hudi.aws.sync.AwsGlueCatalogSyncTool
Config_properties are as follows:
hoodie.deltastreamer.ingestion.tablesToBeIngested= table_name
hoodie.deltastreamer.ingestion.default.res_bestand.configFile=file:///home/hadoop/hudi/config-table_name.properties
hoodie.datasource.write.keygenerator.class:org.apache.hudi.keygen.NonpartitionedKeyGenerator
hoodie.datasource.hive_sync.enable=true
hoodie.datasource.hive_sync.auto_create_database=true
hoodie.schema.on.read.enable=true
hoodie.meta.sync.client.tool.class=org.apache.hudi.aws.sync.AwsGlueCatalogSyncTool
The delta streamer spark-submit command:
spark-submit --jars
"/usr/lib/spark/external/lib/spark-avro.jar,/usr/lib/hudi/hudi-spark-bundle_2.12-0.11.1.jar"
\
--conf spark.sql.legacy.parquet.int96RebaseModeInRead=CORRECTED \
--conf spark.sql.legacy.parquet.int96RebaseModeInWrite=CORRECTED \
--conf spark.sql.legacy.parquet.datetimeRebaseModeInRead=corrected \
--conf spark.sql.legacy.parquet.datetimeRebaseModeInRead=legacy \
--conf spark.sql.legacy.parquet.datetimeRebaseModeInWrite=CORRECTED \
--conf spark.sql.parquet.mergeSchema=true \
--class
org.apache.hudi.utilities.deltastreamer.HoodieMultiTableDeltaStreamer \
--master yarn --deploy-mode client
/home/hadoop/hudi/hudi-utilities-bundle_2.12-0.11.1.jar \
--table-type COPY_ON_WRITE \
--props "file:///home/hadoop/hudi/config-source.properties" \
--config-folder "file:///home/hadoop/hudi/" \
--base-path-prefix "s3://hudi_s3_bucket_name/hudi/" \
--source-class org.apache.hudi.utilities.sources.ParquetDFSSource \
--source-ordering-field event_ts \
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]