[GitHub] [hudi] forest455 opened a new issue #4200: spark-sql query timestamp partition error

GitBox Thu, 02 Dec 2021 20:41:51 -0800


forest455 opened a new issue #4200:
URL: https://github.com/apache/hudi/issues/4200



   hoodieDeltaStreamer config part :
   
   hoodie.datasource.write.recordkey.field=seq_no
   hoodie.datasource.write.partitionpath.field=tran_date
   hoodie.datasource.write.precombine.field=
   
hoodie.datasource.write.keygenerator.class=org.apache.hudi.keygen.TimestampBasedKeyGenerator
   hoodie.datasource.hive_sync.database=ods
   hoodie.datasource.hive_sync.table=hudi_ods_tran
   hoodie.datasource.hive_sync.partition_fields=tran_date_str
   hoodie.deltastreamer.keygen.timebased.timestamp.type=SCALAR
   hoodie.deltastreamer.keygen.timebased.timestamp.scalar.time.unit=days
   hoodie.deltastreamer.keygen.timebased.output.dateformat=yyyy-MM-dd
   
hoodie.datasource.hive_sync.partition_extractor_class=org.apache.hudi.hive.MultiPartKeysValueExtractor
   
   
   after deltaStreamer finished , hudi table metadata  synced to hive \
    ,you can see from excerpt below that partition field 'tran_date_str' is  
absolutely right. 
   
   CREATE EXTERNAL TABLE `hudi_ods_tran`(
   ...
   `_hoodie_is_deleted` boolean)
   PARTITIONED BY ( 
   `tran_date_str` string)
   ROW FORMAT SERDE 
   'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'
   WITH SERDEPROPERTIES
   ...
   
   
   But when I try to use spark-sql to query this table. I use code like below:
   
   val spark = 
SparkSession.builder.config(jssc.getConf).config("spark.sql.catalogImplementation","hive").enableHiveSupport().getOrCreate;
   spark.sql("select count(*) from ods.hudi_ods_tran").show()
   
   I get error like this:
   ...
   Exception in thread "main" 
org.sparkproject.guava.util.concurrent.UncheckedExecutionException: 
java.lang.RuntimeException: Failed to cast value `2021-01-04` to `IntegerType` 
for partition column `tran_date`
   at org.sparkproject.guava.cache.LocalCache$Segment.get(LocalCache.java:2263)
   at org.sparkproject.guava.cache.LocalCache.get(LocalCache.java:4000)
   at 
org.sparkproject.guava.cache.LocalCache$LocalManualCache.get(LocalCache.java:4789)
   at 
org.apache.spark.sql.catalyst.catalog.SessionCatalog.getCachedPlan(SessionCatalog.scala:155)
   at 
org.apache.spark.sql.execution.datasources.FindDataSourceTable.org$apache$spark$sql$execution$datasources$FindDataSourceTable$$readDataSourceTable(DataSourceStrategy.scala:249)
   at 
org.apache.spark.sql.execution.datasources.FindDataSourceTable$$anonfun$apply$2.applyOrElse(DataSourceStrategy.scala:288)
   at 
org.apache.spark.sql.execution.datasources.FindDataSourceTable$$anonfun$apply$2.applyOrElse(DataSourceStrategy.scala:278)
   at 
org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.$anonfun$resolveOperatorsDown$2(AnalysisHelper.scala:108)
   at 
org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:74)
   at 
org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.$anonfun$resolveOperatorsDown$1(AnalysisHelper.scala:108)
   at 
org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper$.allowInvokingTransformsInAnalyzer(AnalysisHelper.scala:221)
   at 
org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.resolveOperatorsDown(AnalysisHelper.scala:106)
   at 
org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.resolveOperatorsDown$(AnalysisHelper.scala:104)
   at org.apache.spark.sql.c
   ...
   Caused by: java.lang.RuntimeException: Failed to cast value `2021-01-04` to 
`IntegerType` for partition column `tran_date`
   at 
org.apache.spark.sql.execution.datasources.PartitioningUtils$.parsePartitionColumn(PartitioningUtils.scala:313)
   at 
org.apache.spark.sql.execution.datasources.PartitioningUtils$.parsePartition(PartitioningUtils.scala:251)
   at 
org.apache.spark.sql.execution.datasources.Spark3ParsePartitionUtil.parsePartition(Spark3ParsePartitionUtil.scala:39)
   at 
org.apache.hudi.HoodieFileIndex.$anonfun$getAllQueryPartitionPaths$3(HoodieFileIndex.scala:486)
   at scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:238)
   at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62)
   at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55)
   at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49)
   at scala.collection.TraversableLike.map(TraversableLike.scala:238)
   
   I have checked other tables without a partition field and it works.
   
   I would be very appreciated  if anyone can  help to solve this prolem.
   
   My env is spark3.1.2 hadoop3.3 hive 3.1.2
   
   
   
   
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [hudi] forest455 opened a new issue #4200: spark-sql query timestamp partition error

Reply via email to