[GitHub] [hudi] chrischnweiss opened a new issue #4946: [SUPPORT] TimestampBasedKeyGenerator failed to parse date_string column

GitBox Thu, 03 Mar 2022 08:30:43 -0800


chrischnweiss opened a new issue #4946:
URL: https://github.com/apache/hudi/issues/4946



   **Describe the problem you faced**
   
   Hey guys, 
   actually I am trying to use DeltaStreamer along with a CustomKeyGenerator to 
use ComplexKeyGenerator and TimeBasedKeyGenerator together. Now I am struggling 
with DateFormat Exception and I can´t figure out why. Maybe you can help me?
   
   These are my config properties for the partitoning:
   
   ```
   hoodie.datasource.write.recordkey.field=id_1,id_2
   hoodie.datasource.write.partitionpath.field=timestamp_col:timestamp
   hoodie.datasource.write.hive_style_partitioning=true
   
hoodie.datasource.write.keygenerator.class=org.apache.hudi.keygen.CustomKeyGenerator
   hoodie.deltastreamer.keygen.timebased.timestamp.type=DATE_STRING
   hoodie.deltastreamer.keygen.timebased.input.dateformat="yyyy-MM-dd 
HH:mm:ss.SSS Z"
   hoodie.deltastreamer.keygen.timebased.output.dateformat="yyyy/MM/dd"
   ```
   
   **Expected behavior**
   
   I want to parse my date_string as input to partition the target dataset.
   My timestamp_col looks like this: `2022-01-13 16:57:05.659 +01:00`
   
   **Environment Description**
   
   Hudi version :
   0.10.0
   
   Spark version :
   3.1.1
   
   Hive version :
   3.1.3000
   
   Hadoop version :
   3.1.1
   
   Storage (HDFS/S3/GCS..) :
   HDFS
   
   Running on Docker? (yes/no) :
   no
   
   
   **Stacktrace**
   
   Exception in thread "main" org.apache.spark.SparkException: Job aborted due 
to stage failure: Task 0 in stage 2.0 failed 4 times, most recent failure: Lost 
task 0.3 in stage 2.0 (TID 11) (t-worker.node.asdasd.de executor 1): 
org.apache.hudi.exception.HoodieKeyGeneratorException: Unable to parse input 
partition field :2022-01-13 16:57:05.659 +01:00
           at 
org.apache.hudi.keygen.TimestampBasedAvroKeyGenerator.getPartitionPath(TimestampBasedAvroKeyGenerator.java:135)
           at 
org.apache.hudi.keygen.CustomAvroKeyGenerator.getPartitionPath(CustomAvroKeyGenerator.java:89)
           at 
org.apache.hudi.keygen.CustomKeyGenerator.getPartitionPath(CustomKeyGenerator.java:68)
           at 
org.apache.hudi.keygen.BaseKeyGenerator.getKey(BaseKeyGenerator.java:62)
           at 
org.apache.hudi.utilities.deltastreamer.DeltaSync.lambda$readFromSource$d62e16$1(DeltaSync.java:453)
           at 
org.apache.spark.api.java.JavaPairRDD$.$anonfun$toScalaFunction$1(JavaPairRDD.scala:1070)
           at scala.collection.Iterator$$anon$10.next(Iterator.scala:459)
           at 
org.apache.spark.storage.memory.MemoryStore.putIterator(MemoryStore.scala:222)
           at 
org.apache.spark.storage.memory.MemoryStore.putIteratorAsBytes(MemoryStore.scala:349)
           at 
org.apache.spark.storage.BlockManager.$anonfun$doPutIterator$1(BlockManager.scala:1440)
           at 
org.apache.spark.storage.BlockManager.org$apache$spark$storage$BlockManager$$doPut(BlockManager.scala:1350)
           at 
org.apache.spark.storage.BlockManager.doPutIterator(BlockManager.scala:1414)
           at 
org.apache.spark.storage.BlockManager.getOrElseUpdate(BlockManager.scala:1237)
           at org.apache.spark.rdd.RDD.getOrCompute(RDD.scala:384)
           at org.apache.spark.rdd.RDD.iterator(RDD.scala:335)
           at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
           at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373)
           at org.apache.spark.rdd.RDD.iterator(RDD.scala:337)
           at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
           at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373)
           at org.apache.spark.rdd.RDD.iterator(RDD.scala:337)
           at 
org.apache.spark.shuffle.ShuffleWriteProcessor.write(ShuffleWriteProcessor.scala:59)
           at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99)
           at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:52)
           at org.apache.spark.scheduler.Task.run(Task.scala:131)
           at 
org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:497)
           at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1439)
           at 
org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:500)
           at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
           at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
           at java.lang.Thread.run(Thread.java:748)
   Caused by: java.lang.IllegalArgumentException: Invalid format: "2022-01-13 
16:57:05.659 +01:00"
           at 
org.joda.time.format.DateTimeFormatter.parseDateTime(DateTimeFormatter.java:945)
           at 
org.apache.hudi.keygen.TimestampBasedAvroKeyGenerator.getPartitionPath(TimestampBasedAvroKeyGenerator.java:202)
           at 
org.apache.hudi.keygen.TimestampBasedAvroKeyGenerator.getPartitionPath(TimestampBasedAvroKeyGenerator.java:133)
           ... 30 more
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [hudi] chrischnweiss opened a new issue #4946: [SUPPORT] TimestampBasedKeyGenerator failed to parse date_string column

Reply via email to