[GitHub] [hudi] Virmaline commented on issue #6278: [SUPPORT] Deltastreamer fails with data and timestamp related exception after upgrading to EMR 6.5 and spark3

GitBox Thu, 15 Dec 2022 14:23:01 -0800


Virmaline commented on issue #6278:
URL: https://github.com/apache/hudi/issues/6278#issuecomment-1353791597


   @alexeykudinkin
   
   Hey Alexey, 
   
   I'm also still getting the same error after updating to 0.12.1.
   
   Hudi: 0.12.1-amzn-0-SNAPSHOT
   Spark: 3.3.0
   EMR: 6.9.0
   
   `spark-submit 
   --master yarn 
   --deploy-mode cluster 
   --conf 
spark.serializer=org.apache.spark.serializer.KryoSerializer,spark.sql.parquet.datetimeRebaseModeInRead=CORRECTED,spark.sql.parquet.datetimeRebaseModeInWrite=CORRECTED,spark.sql.avro.datetimeRebaseModeInWrite=CORRECTED,spark.sql.avro.datetimeRebaseModeInRead=CORRECTED,spark.sql.legacy.parquet.datetimeRebaseModeInRead=CORRECTED,spark.sql.legacy.parquet.datetimeRebaseModeInWrite=CORRECTED,spark.sql.legacy.parquet.int96RebaseModeInRead=CORRECTED,spark.sql.legacy.parquet.int96RebaseModeInWrite=CORRECTED
 
   --class org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer 
/usr/lib/hudi/hudi-utilities-bundle.jar 
   --table-type COPY_ON_WRITE 
   --source-ordering-field replicadmstimestamp 
   --source-class org.apache.hudi.utilities.sources.ParquetDFSSource 
   --target-base-path s3://bucket/folder/folder/table 
   --target-table table 
   --payload-class org.apache.hudi.common.model.AWSDmsAvroPayload 
   --hoodie-conf 
hoodie.datasource.write.keygenerator.class=org.apache.hudi.keygen.TimestampBasedKeyGenerator
 
   --hoodie-conf 
hoodie.deltastreamer.keygen.timebased.timestamp.type=DATE_STRING 
   --hoodie-conf 
hoodie.deltastreamer.keygen.timebased.output.dateformat=yyyy-MM 
   --hoodie-conf 
"hoodie.deltastreamer.keygen.timebased.input.dateformat=yyyy-MM-dd 
HH:mm:ss.SSSSSS" 
   --hoodie-conf hoodie.datasource.write.recordkey.field=_id 
   --hoodie-conf 
hoodie.datasource.write.partitionpath.field=replicadmstimestamp 
   --hoodie-conf 
hoodie.deltastreamer.source.dfs.root=s3://bucket/folder/folder/table`
   
   I've tried about the every combination of the datetimeRebaseMode I've 
managed to think of, and the result is always the same.
   
   stacktrace included, is there any possible workaround for this? I currently 
have a separate process to change the timestamp columns, which works, but adds 
a bunch of overhead to the process. 
   
   
[stacktrace.txt](https://github.com/apache/hudi/files/10241150/stacktrace.txt)
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [hudi] Virmaline commented on issue #6278: [SUPPORT] Deltastreamer fails with data and timestamp related exception after upgrading to EMR 6.5 and spark3

Reply via email to