[GitHub] [hudi] rohit-m-99 opened a new issue, #6335: [SUPPORT] Deltastreamer updates not supporting the addition of new columns

GitBox Mon, 08 Aug 2022 14:51:05 -0700


rohit-m-99 opened a new issue, #6335:
URL: https://github.com/apache/hudi/issues/6335


   **Describe the problem you faced**
   
   Currently using the delatstreamer to ingested from one S3 bucket to another. 
In Hudi v10 I would use the upsert operation in the delatstreamer. When a new 
column was added to the schema the target table would reflect that.
   
   However in Hudi 0.11.1 using the insert operation, schema changes are not 
reflected in the target table - specifically the addition of nullable columns.
   
   **To Reproduce**
   
   Steps to reproduce the behavior:
   
   1. Start the deltastreamer using the script below
   2. Add a new nullable column
   3. Query from the target table for the new column
   
   ```
   spark-submit \
   --jars 
opt/spark/jars/hudi-utilities-bundle.jar,/opt/spark/jars/hadoop-aws.jar,/opt/spark/jars/aws-java-sdk.jar
 \
   --master spark://spark-master:7077 \
   --total-executor-cores 20 \
   --executor-memory 4g \
   --conf spark.hadoop.fs.s3a.impl=org.apache.hadoop.fs.s3a.S3AFileSystem \
   --class org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer 
opt/spark/jars/hudi-utilities-bundle.jar \
   --source-class org.apache.hudi.utilities.sources.ParquetDFSSource \
   --target-table per_tick_stats \
   --table-type COPY_ON_WRITE \
   --min-sync-interval-seconds 30 \
   --source-limit 250000000 \
   --continuous \
   --source-ordering-field $3 \
   --target-base-path $2 \
   --hoodie-conf hoodie.deltastreamer.source.dfs.root=$1 \
   --hoodie-conf 
hoodie.datasource.write.keygenerator.class=org.apache.hudi.keygen.ComplexKeyGenerator
 \
   --hoodie-conf hoodie.datasource.write.recordkey.field=$4 \
   --hoodie-conf hoodie.datasource.write.precombine.field=$3 \
   --hoodie-conf hoodie.clustering.plan.strategy.sort.columns=$5 \
   --hoodie-conf hoodie.datasource.write.partitionpath.field=$6 \
   --hoodie-conf hoodie.clustering.inline=true \
   --hoodie-conf hoodie.clustering.plan.strategy.small.file.limit=100000000 \
   --hoodie-conf hoodie.clustering.inline.max.commits=4 \
   --hoodie-conf hoodie.metadata.enable=true \
   --hoodie-conf hoodie.metadata.index.column.stats.enable=true \
   --op INSERT
   ```
   
   ```
   ./deltastreamer.sh s3a://simian-example-prod-output/stats/ingesting 
s3a://simian-example-prod-output/stats/querying STATOVYGIYLUMVSF6YLU 
STATONUW25LMMF2GS33OL5ZHK3S7NFSA____,STATONUW2X3UNFWWK___ 
STATONUW25LMMF2GS33OL5ZHK3S7NFSA____,STATMJQXIY3IL5ZHK3S7NFSA____
   ```
   **Expected behavior**
   
   New nullable column should be present in the target table
   
   **Environment Description**
   
   * Hudi version : 0.11.1
   
   * Spark version : 3.1.2
   
   * Hive version : 3.2.0
   
   * Storage (HDFS/S3/GCS..) : S3
   
   * Running on Docker? (yes/no) : yes
   
   **Additional context**
   
   Initially used upsert but was unable to continue using it because of this 
issue: https://github.com/apache/hudi/issues/6015
   **Stacktrace**
   
   ```Add the stacktrace of the error.```
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [hudi] rohit-m-99 opened a new issue, #6335: [SUPPORT] Deltastreamer updates not supporting the addition of new columns

Reply via email to