[GitHub] [hudi] rufferjr opened a new issue #1923: [SUPPORT] Hive Sync fails to add decimal partition

GitBox Wed, 05 Aug 2020 16:18:55 -0700


rufferjr opened a new issue #1923:
URL: https://github.com/apache/hudi/issues/1923



   **Describe the problem you faced**
   
   (Possible related issue: https://github.com/apache/hudi/issues/1790) When 
creating a new table with a decimal type partition column, Hudi fails on sync 
with the Hive metastore. The container log stacktrace from the issue can be 
found in [this 
gist](https://gist.github.com/rufferjr/e6d2955eb3eb1edb321e10c9a91e021c). This 
was tested with a hudi-spark-bundle built from commit 539621bd.
   
   **To Reproduce**
   
   I was able to produce this error by running the following code:
   
   ```
   val hudiOptions = 
       Map[String, String](
         "hoodie.table.name" -> "test_table",
         "hoodie.datasource.write.table.name" -> "test_table",
         "hoodie.consistency.check.enabled" -> "true",
         "hoodie.compact.inline.max.delta.commits" -> "12",
         "hoodie.compact.inline" -> "true",
         "hoodie.clean.automatic" -> "true",
         "hoodie.cleaner.commits.retained" -> "1",
         "hoodie.datasource.write.table.type" -> "MERGE_ON_READ",
         "hoodie.datasource.write.recordkey.field" -> "pk",
         "hoodie.datasource.write.keygenerator.class" -> 
"org.apache.hudi.keygen.ComplexKeyGenerator",
         "hoodie.datasource.write.partitionpath.field" -> "_partition_col",
         "hoodie.datasource.write.precombine.field" -> "sort_k",
         "hoodie.bulkinsert.shuffle.parallelism" -> "1800",
         "hoodie.parquet.max.file.size" -> String.valueOf(500 * 1024 * 1024), 
//500mb
         "hoodie.datasource.hive_sync.enable" -> "true",
         "hoodie.datasource.hive_sync.database" -> "test_vault",
         "hoodie.datasource.hive_sync.table" -> "test_table",
         "hoodie.datasource.hive_sync.partition_fields" -> "_partition_col",
         "hoodie.datasource.hive_sync.partition_extractor_class" -> 
"org.apache.hudi.hive.MultiPartKeysValueExtractor",
         "hoodie.datasource.hive_sync.jdbcurl" -> 
s"jdbc:hive2://${hiveServer2URI}:10000"
       )
   
   spark.sql("CREATE DATABASE IF NOT EXISTS test_vault")
   spark.sql("USE test_vault")
   spark.sql("DROP TABLE IF EXISTS test_table_ro")
   spark.sql("DROP TABLE IF EXISTS test_table_rt")
   
   val data = Seq(Row("A", Decimal(70646037), 2))
   val schema = List(StructField("pk", StringType, true), 
StructField("partition_col", DecimalType(3, 0), true), StructField("sort_k", 
IntegerType, true))
   var df = spark.createDataFrame(sc.parallelize(data), StructType(schema))
   
   df.withColumn("_partition_col", expr(s"MOD(FLOOR(partition_col / 20000), 
100)")).write.format("org.apache.hudi").option("hoodie.datasource.write.operation",
 
"upsert").options(hudiOptions).mode(SaveMode.Overwrite).save("%s/%s/%s".format("s3://hudi-test-bucket/data",
 "test_vault", "test_table"))
   ```
   
   **Expected behavior**
   
   I would expect Hudi to be able to handle DecimalTypes in partition columns, 
rather than the failure we see above.
   
   **Environment Description**
   
   * Hudi version : 0.6.0-SNAPShOT
   
   * Spark version : 2.11
   
   * Hive version : 2.3.6
   
   * Hadoop version : 2.8.5
   
   * Storage (HDFS/S3/GCS..) : S3
   
   * Running on Docker? (yes/no) : no
   
   **Stacktrace**
   
   Stacktrace 
[gist](https://gist.github.com/rufferjr/e6d2955eb3eb1edb321e10c9a91e021c).
   
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [hudi] rufferjr opened a new issue #1923: [SUPPORT] Hive Sync fails to add decimal partition

Reply via email to