[jira] [Comment Edited] (HUDI-4430) Incorrect type casting while reading HUDI table created with CustomKeyGenerator and unixtimestamp paritioning field

Rajesh Mahindra (Jira) Thu, 21 Jul 2022 23:31:12 -0700


    [ 
https://issues.apache.org/jira/browse/HUDI-4430?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17569833#comment-17569833
 ]


Rajesh Mahindra edited comment on HUDI-4430 at 7/22/22 6:30 AM:
----------------------------------------------------------------

Looks like your input column is of type string, did you try with the following 
config:

 
|{{hoodie.deltastreamer.keygen.timebased.timestamp.type="DATE_STRING"}}|

 

and 

 
|{{hoodie.deltastreamer.keygen.timebased.input.dateformat="yyyy-MM-dd'T'HH:mm:ssZ,yyyy-MM-dd'T'HH:mm:ss.SSSZ,yyyyMMdd"}}|

depending on the actual format


was (Author: rmahindra):
Looks like your input column is of type string, did you try with the following 
config:

 
|{{hoodie.deltastreamer.keygen.timebased.timestamp.type="DATE_STRING"}}|

 

 

> Incorrect type casting while reading HUDI table created with 
> CustomKeyGenerator and unixtimestamp paritioning field
> -------------------------------------------------------------------------------------------------------------------
>
>                 Key: HUDI-4430
>                 URL: https://issues.apache.org/jira/browse/HUDI-4430
>             Project: Apache Hudi
>          Issue Type: Improvement
>            Reporter: Volodymyr Burenin
>            Priority: Critical
>
> Hi,
> I have discovered an issue that doesn't play nicely with the custom key 
> generatosr, basically anything that is not TimestampBasedKeyGenerator or 
> TimestampBasedAvroKeyGenerator.
> {{While trying to read a table that was created with these parameters(the 
> rest don't matter):}}
> {code:java}
> hoodie.datasource.write.recordkey.field=query_id,event_type
> hoodie.datasource.write.keygenerator.class=org.apache.hudi.keygen.CustomKeyGenerator
> hoodie.datasource.write.partitionpath.field=create_time_epoch_seconds:timestamp
> hoodie.deltastreamer.keygen.timebased.timestamp.type=UNIX_TIMESTAMP
> hoodie.deltastreamer.keygen.timebased.output.dateformat=yyyy/MM/dd{code}
> {color:#172b4d}I get and error that looks like:{color}
> {code:java}
> 22/07/20 20:32:48 DEBUG Spark32HoodieParquetFileFormat: Appending 
> StructType(StructField(create_time_epoch_seconds,LongType,true)) [2022/07/13]
> 22/07/20 20:32:48 ERROR Executor: Exception in task 0.0 in stage 5.0 (TID 5)
> java.lang.ClassCastException: org.apache.spark.unsafe.types.UTF8String cannot 
> be cast to java.lang.Long
>     at scala.runtime.BoxesRunTime.unboxToLong(BoxesRunTime.java:107)
>     at 
> org.apache.spark.sql.catalyst.expressions.BaseGenericInternalRow.getLong(rows.scala:42)
>     at 
> org.apache.spark.sql.catalyst.expressions.BaseGenericInternalRow.getLong$(rows.scala:42)
>     at 
> org.apache.spark.sql.catalyst.expressions.GenericInternalRow.getLong(rows.scala:195)
>     at 
> org.apache.spark.sql.execution.vectorized.ColumnVectorUtils.populate(ColumnVectorUtils.java:66)
>     at 
> org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.initBatch(VectorizedParquetRecordReader.java:245)
>  {code}
> Apparently the issue is in _partitionSchemaFromProperties function in file: 
> hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/SparkHoodieTableFileIndex.scala
> that checks for a class type it uses StructType of String for.
> Once it is any non Timestamp based known class it basically uses whatever 
> type it is and then fails to retrieve the value for.
> I have a proposal here which we probably need: Give a user a way to force a 
> string type if needed and add ability to add a prefixed column that contains 
> a processed partition value. It could be done as two separate features.
> This problem is critical for me, so I have to change Hoodie source code on my 
> end temporary to make it work.
> Here is how I roughly changed the referenced function:
>  
> {code:java}
> /**
>  * Get the partition schema from the hoodie.properties.
>  */
> private lazy val _partitionSchemaFromProperties: StructType = {
>   val tableConfig = metaClient.getTableConfig
>   val partitionColumns = tableConfig.getPartitionFields
>   if (partitionColumns.isPresent) {
>     val partitionFields = partitionColumns.get().map(column => 
> StructField("_hoodie_"+column, StringType))
>     StructType(partitionFields)
>   } else {
>     // If the partition columns have not stored in hoodie.properties(the 
> table that was
>     // created earlier), we trait it as a non-partitioned table.
>     logWarning("No partition columns available from hoodie.properties." +
>       " Partition pruning will not work")
>     new StructType()
>   }
> } {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Comment Edited] (HUDI-4430) Incorrect type casting while reading HUDI table created with CustomKeyGenerator and unixtimestamp paritioning field

Reply via email to