Lokesh Jain created HUDI-8311:
---------------------------------
Summary: Support multi part partition format with hive
Key: HUDI-8311
URL: https://issues.apache.org/jira/browse/HUDI-8311
Project: Apache Hudi
Issue Type: Bug
Reporter: Lokesh Jain
Currently a format like YYYY/MM/DD fails when syncing with hive. The Jira aims
to add a fix so that such a format is supported.
Steps to reproduce: The table created below uses a custom keygen with
combination of simple and timestamp keygen. Timestamp keygen produces an output
of format - YYYY/MM/DD
{code:java}
import org.apache.hudi.HoodieSparkUtils
import org.apache.hudi.common.config.TypedProperties
import org.apache.hudi.common.util.StringUtils
import org.apache.hudi.exception.HoodieException
import org.apache.hudi.functional.TestSparkSqlWithCustomKeyGenerator._
import org.apache.hudi.testutils.HoodieClientTestUtils.createMetaClient
import org.apache.hudi.util.SparkKeyGenUtilsimport org.apache.spark.sql.SaveMode
import org.apache.spark.sql.hudi.common.HoodieSparkSqlTestBase
import org.joda.time.DateTime
import org.joda.time.format.DateTimeFormat
import org.junit.jupiter.api.Assertions.{assertEquals, assertFalse, assertTrue}
import org.slf4j.LoggerFactory
val df = spark.sql(
s"""SELECT 1 as id, 'a1' as name, 1.6 as price, 1704121827 as ts, 'cat1'
as segment
| UNION
| SELECT 2 as id, 'a2' as name, 10.8 as price, 1704121827 as ts,
'cat1' as segment
| UNION
| SELECT 3 as id, 'a3' as name, 30.0 as price, 1706800227 as ts,
'cat1' as segment
| UNION
| SELECT 4 as id, 'a4' as name, 103.4 as price, 1701443427 as ts,
'cat2' as segment
| UNION
| SELECT 5 as id, 'a5' as name, 1999.0 as price, 1704121827 as ts,
'cat2' as segment
| UNION
| SELECT 6 as id, 'a6' as name, 80.0 as price, 1704121827 as ts,
'cat3' as segment
|""".stripMargin)
df.write.format("hudi").option("hoodie.datasource.write.table.type",
"MERGE_ON_READ").option("hoodie.datasource.write.keygenerator.class",
"org.apache.hudi.keygen.CustomAvroKeyGenerator").option("hoodie.datasource.write.partitionpath.field",
"segment:simple,ts:timestamp").option("hoodie.datasource.write.recordkey.field",
"id").option("hoodie.datasource.write.precombine.field",
"name").option("hoodie.table.name",
"hudi_table_2").option("hoodie.insert.shuffle.parallelism",
"1").option("hoodie.upsert.shuffle.parallelism",
"1").option("hoodie.bulkinsert.shuffle.parallelism",
"1").option("hoodie.keygen.timebased.timestamp.type",
"SCALAR").option("hoodie.keygen.timebased.output.dateformat",
"yyyy/MM/DD").option("hoodie.keygen.timebased.timestamp.scalar.time.unit",
"seconds").mode(SaveMode.Overwrite).save("/user/hive/warehouse/hudi_table_2")
// Sync with hive
/var/hoodie/ws/hudi-sync/hudi-hive-sync/run_sync_tool.sh \
--jdbc-url jdbc:hive2://hiveserver:10000 \
--user hive \
--pass hive \
--partitioned-by segment,ts \
--base-path /user/hive/warehouse/hudi_table_2 \
--database default \
--table hudi_table_2 \
--partition-value-extractor org.apache.hudi.hive.MultiPartKeysValueExtractor
{code}
Hive creation fails now.
{code:java}
2024-10-06 14:33:44,200 INFO [main] hive.metastore
(HiveMetaStoreClient.java:close(564)) - Closed a connection to metastore,
current connections: 0
Exception in thread "main" org.apache.hudi.exception.HoodieException: Got
runtime exception when hive syncing hudi_table_2
at org.apache.hudi.hive.HiveSyncTool.syncHoodieTable(HiveSyncTool.java:180)
at org.apache.hudi.hive.HiveSyncTool.main(HiveSyncTool.java:547)
Caused by: org.apache.hudi.hive.HoodieHiveSyncException: failed to sync the
table hudi_table_2_ro
at org.apache.hudi.hive.HiveSyncTool.syncHoodieTable(HiveSyncTool.java:272)
at org.apache.hudi.hive.HiveSyncTool.doSync(HiveSyncTool.java:203)
at org.apache.hudi.hive.HiveSyncTool.syncHoodieTable(HiveSyncTool.java:177)
... 1 more
Caused by: org.apache.hudi.hive.HoodieHiveSyncException: Failed to sync
partitions for table hudi_table_2_ro
at
org.apache.hudi.hive.HiveSyncTool.syncAllPartitions(HiveSyncTool.java:474)
at
org.apache.hudi.hive.HiveSyncTool.validateAndSyncPartitions(HiveSyncTool.java:321)
at org.apache.hudi.hive.HiveSyncTool.syncHoodieTable(HiveSyncTool.java:261)
... 3 more
Caused by: java.lang.IllegalArgumentException: Partition key parts [segment,
ts] does not match with partition values [cat1, 2024, 01, 01]. Check partition
strategy.
at
org.apache.hudi.common.util.ValidationUtils.checkArgument(ValidationUtils.java:42)
at
org.apache.hudi.hive.ddl.QueryBasedDDLExecutor.getPartitionClause(QueryBasedDDLExecutor.java:191)
at
org.apache.hudi.hive.ddl.QueryBasedDDLExecutor.constructAddPartitions(QueryBasedDDLExecutor.java:164)
at
org.apache.hudi.hive.ddl.QueryBasedDDLExecutor.addPartitionsToTable(QueryBasedDDLExecutor.java:124)
at
org.apache.hudi.hive.HoodieHiveSyncClient.addPartitionsToTable(HoodieHiveSyncClient.java:118)
at org.apache.hudi.hive.HiveSyncTool.syncPartitions(HiveSyncTool.java:516)
at
org.apache.hudi.hive.HiveSyncTool.syncAllPartitions(HiveSyncTool.java:470)
... 5 more {code}
--
This message was sent by Atlassian Jira
(v8.20.10#820010)