[GitHub] [hudi] SubashRanganathan opened a new issue, #6304: [SUPPORT]

GitBox Thu, 04 Aug 2022 10:09:59 -0700


SubashRanganathan opened a new issue, #6304:
URL: https://github.com/apache/hudi/issues/6304


   Issue: Hudi MultiTable Deltastreamer not updating glue catalog when new 
column added on Source.
   Hudi Version: 0.11
   Description: 
   Hudi Multi Table deltaStreamer job does not update AWS Glue Catalog when new 
column is added on files ingested by DMS. The following property is used in the 
hudi configurations. 
"hoodie.meta.sync.client.tool.class=org.apache.hudi.aws.sync.AwsGlueCatalogSyncTool"
   
   The hudi table.properties are as follows:
   hoodie.datasource.write.recordkey.field=col_name
   hoodie.datasource.write.precombine.field=ts
   hoodie.datasource.write.partitionpath_field_opt_key=""
   
hoodie.deltastreamer.ingestion.targetBasePath=s3://hudi_s3_bucket_name/hudi/hudi_table_name/
   
hoodie.deltastreamer.source.dfs.root=s3://s3_dms_landing_bucket_name/source_folder
   hoodie.avro.schema.validate=false
   
hoodie.datasource.write.keygenerator.class=org.apache.hudi.keygen.NonpartitionedKeyGenerator
   hoodie.datasource.hive_sync.enable=true
   hoodie.datasource.hive_sync.auto_create_database=true
   hoodie.datasource.hive_sync.database=database_name
   hoodie.datasource.hive_sync.table=table_name
   hoodie.datasource.hive.partition_fields_opt_key=""
   hoodie.schema.on.read.enable=true
   
hoodie.meta.sync.client.tool.class=org.apache.hudi.aws.sync.AwsGlueCatalogSyncTool
   
   Config_properties are as follows:
   hoodie.deltastreamer.ingestion.tablesToBeIngested= table_name
   
hoodie.deltastreamer.ingestion.default.res_bestand.configFile=file:///home/hadoop/hudi/config-table_name.properties
   
hoodie.datasource.write.keygenerator.class:org.apache.hudi.keygen.NonpartitionedKeyGenerator
   hoodie.datasource.hive_sync.enable=true
   hoodie.datasource.hive_sync.auto_create_database=true
   hoodie.schema.on.read.enable=true
   
hoodie.meta.sync.client.tool.class=org.apache.hudi.aws.sync.AwsGlueCatalogSyncTool
   
   The delta streamer spark-submit command:
   spark-submit --jars 
"/usr/lib/spark/external/lib/spark-avro.jar,/usr/lib/hudi/hudi-spark-bundle_2.12-0.11.1.jar"
 \
     --conf spark.sql.legacy.parquet.int96RebaseModeInRead=CORRECTED \
     --conf spark.sql.legacy.parquet.int96RebaseModeInWrite=CORRECTED \
     --conf spark.sql.legacy.parquet.datetimeRebaseModeInRead=corrected \
     --conf spark.sql.legacy.parquet.datetimeRebaseModeInRead=legacy \
     --conf spark.sql.legacy.parquet.datetimeRebaseModeInWrite=CORRECTED \
     --conf spark.sql.parquet.mergeSchema=true \
     --class 
org.apache.hudi.utilities.deltastreamer.HoodieMultiTableDeltaStreamer  \
     --master yarn --deploy-mode client  
/home/hadoop/hudi/hudi-utilities-bundle_2.12-0.11.1.jar \
     --table-type COPY_ON_WRITE   \
     --props "file:///home/hadoop/hudi/config-source.properties"   \
     --config-folder "file:///home/hadoop/hudi/"   \
     --base-path-prefix "s3://hudi_s3_bucket_name/hudi/"   \
     --source-class org.apache.hudi.utilities.sources.ParquetDFSSource   \
     --source-ordering-field event_ts \


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [hudi] SubashRanganathan opened a new issue, #6304: [SUPPORT]

Reply via email to