xqy179 opened a new issue #2234:
URL: https://github.com/apache/hudi/issues/2234
**Describe the problem you faced**
I write a hudi table with spark datasource api! The table contains the
flowing three fields "year,month,day", I use "year,month,day" as partition
keys, and partition value extractor uses
"org.apache.hudi.hive.MultiPartKeysValueExtractor", config
HIVE_SYNC_ENABLED_OPT_KEY to be "true" ! After I finish writing data to
storage[hdfs/s3], Hudi will sync the table partitions to hive table
automatically! The problem is happened in the HiveSync processing! By the
way, my hudi version is 0.6.1!
**Additional context**
After I trace the code! I think this is a bug in the follow
**'syncPartitions'** code. eg, the hive/hudi table have contain partition
'year=2020/month=01/day=05', when you write a new partition
'year=2020/month=05/day=01' to the table, it will throw error! Because in
the **getPartitionEvents** method logic, 'year=2020/month=01/day=05' will
transfer to be "01, 05, 2020" and 'year=2020/month=05/day=01' will transfer to
be "01, 05, 2020" too, So the new partition 'year=2020/month=05/day=01' is
treated as a update event actually it is a new partition, and the flowing
processing will update the table partitions using **"ALTER TABLE `orders`
PARTITION (`year`='2020',`month`='11',`day`='01') SET LOCATION ..."**! The
update a not existed partition will throw error!
I think I can comment the flowing two line codes and not affect other
features to fix the error, Can I ?
```
List<PartitionEvent> getPartitionEvents()
...
//Collections.sort(hivePartitionValues);
...
//Collections.sort(storagePartitionValues);
...
```
```
/**
* Iterate over the storage partitions and find if there are any new
partitions that need to be added or updated.
* Generate a list of PartitionEvent based on the changes required.
*/
List<PartitionEvent> getPartitionEvents(List<Partition> tablePartitions,
List<String> partitionStoragePartitions) {
Map<String, String> paths = new HashMap<>();
for (Partition tablePartition : tablePartitions) {
List<String> hivePartitionValues = tablePartition.getValues();
Collections.sort(hivePartitionValues); //**Maybe there is a bug
Here!**
String fullTablePartitionPath =
Path.getPathWithoutSchemeAndAuthority(new
Path(tablePartition.getSd().getLocation())).toUri().getPath();
paths.put(String.join(", ", hivePartitionValues),
fullTablePartitionPath);
}
List<PartitionEvent> events = new ArrayList<>();
for (String storagePartition : partitionStoragePartitions) {
Path storagePartitionPath =
FSUtils.getPartitionPath(syncConfig.basePath, storagePartition);
String fullStoragePartitionPath =
Path.getPathWithoutSchemeAndAuthority(storagePartitionPath).toUri().getPath();
// Check if the partition values or if hdfs path is the same
List<String> storagePartitionValues =
partitionValueExtractor.extractPartitionValuesInPath(storagePartition);
Collections.sort(storagePartitionValues);//**Maybe there is a bug
Here!**
if (!storagePartitionValues.isEmpty()) {
String storageValue = String.join(", ", storagePartitionValues);
if (!paths.containsKey(storageValue)) {
events.add(PartitionEvent.newPartitionAddEvent(storagePartition));
} else if
(!paths.get(storageValue).equals(fullStoragePartitionPath)) {
events.add(PartitionEvent.newPartitionUpdateEvent(storagePartition));
}
}
}
return events;
}
```
**Stacktrace**
```20/11/05 17:56:17 ERROR HiveSyncTool: Got runtime exception when hive
syncing
org.apache.hudi.hive.HoodieHiveSyncException: Failed to sync partitions for
table xtable
at
org.apache.hudi.hive.HiveSyncTool.syncPartitions(HiveSyncTool.java:187)
at
org.apache.hudi.hive.HiveSyncTool.syncHoodieTable(HiveSyncTool.java:126)
at
org.apache.hudi.hive.HiveSyncTool.syncHoodieTable(HiveSyncTool.java:87)
at
org.apache.hudi.HoodieSparkSqlWriter$.syncHive(HoodieSparkSqlWriter.scala:228)
at
org.apache.hudi.HoodieSparkSqlWriter$.checkWriteStatus(HoodieSparkSqlWriter.scala:278)
at
org.apache.hudi.HoodieSparkSqlWriter$.write(HoodieSparkSqlWriter.scala:183)
at org.apache.hudi.DefaultSource.createRelation(DefaultSource.scala:91)
at
org.apache.spark.sql.execution.datasources.SaveIntoDataSourceCommand.run(SaveIntoDataSourceCommand.scala:45)
at
org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:70)
at
org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:68)
at
org.apache.spark.sql.execution.command.ExecutedCommandExec.doExecute(commands.scala:86)
at
org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:131)
at
org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:127)
at
org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:156)
at
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at
org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:152)
at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:127)
at
org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:80)
at
org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:80)
at
org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:676)
at
org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:676)
at
org.apache.spark.sql.execution.SQLExecution$$anonfun$withNewExecutionId$1.apply(SQLExecution.scala:78)
at
org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:125)
at
org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:73)
at
org.apache.spark.sql.DataFrameWriter.runCommand(DataFrameWriter.scala:676)
at
org.apache.spark.sql.DataFrameWriter.saveToV1Source(DataFrameWriter.scala:285)
at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:271)
at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:229)
at
com.pupu.bigdata.wrangling.utils.HudiUtils$.write_hudi_table(HudiUtils.scala:89)
at
com.pupu.bigdata.wrangling.ods.StageToOdsHudi$.writeDataToHudiTbl(StageToOdsHudi.scala:356)
at
com.pupu.bigdata.wrangling.ods.StageToOdsHudi$.execute_cur_hour_etl(StageToOdsHudi.scala:264)
at
com.pupu.bigdata.wrangling.ods.StageToOdsHudi$$anonfun$main$1.apply(StageToOdsHudi.scala:96)
at
com.pupu.bigdata.wrangling.ods.StageToOdsHudi$$anonfun$main$1.apply(StageToOdsHudi.scala:84)
at
scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:186)
at
com.pupu.bigdata.wrangling.ods.StageToOdsHudi$.main(StageToOdsHudi.scala:84)
at
com.pupu.bigdata.wrangling.ods.StageToOdsHudi.main(StageToOdsHudi.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at
org.apache.spark.deploy.yarn.ApplicationMaster$$anon$2.run(ApplicationMaster.scala:684)
Caused by: org.apache.hudi.hive.HoodieHiveSyncException: Failed in executing
SQL ALTER TABLE ` xtable` PARTITION (`year`='2020',`month`='11',`day`='01') SET
LOCATION 's3://*/year=2020/month=11/day=01'
at
org.apache.hudi.hive.HoodieHiveClient.updateHiveSQL(HoodieHiveClient.java:488)
at
org.apache.hudi.hive.HoodieHiveClient.updatePartitionsToTable(HoodieHiveClient.java:160)
at
org.apache.hudi.hive.HiveSyncTool.syncPartitions(HiveSyncTool.java:185)
... 41 more
Caused by: org.apache.hive.service.cli.HiveSQLException: Error while
compiling statement: FAILED: SemanticException [Error 10006]: Partition not
found {year=2020, month=11, day=01}
at org.apache.hive.jdbc.Utils.verifySuccess(Utils.java:256)
at org.apache.hive.jdbc.Utils.verifySuccessWithInfo(Utils.java:242)
at org.apache.hive.jdbc.HiveStatement.execute(HiveStatement.java:254)
at
org.apache.hudi.hive.HoodieHiveClient.updateHiveSQL(HoodieHiveClient.java:486)
... 43 more
Caused by: org.apache.hive.service.cli.HiveSQLException: Error while
compiling statement: FAILED: SemanticException [Error 10006]: Partition not
found {year=2020, month=11, day=01}
at
org.apache.hive.service.cli.operation.Operation.toSQLException(Operation.java:380)
at
org.apache.hive.service.cli.operation.SQLOperation.prepare(SQLOperation.java:206)
at
org.apache.hive.service.cli.operation.SQLOperation.runInternal(SQLOperation.java:290)
at
org.apache.hive.service.cli.operation.Operation.run(Operation.java:320)
at
org.apache.hive.service.cli.session.HiveSessionImpl.executeStatementInternal(HiveSessionImpl.java:530)
at
org.apache.hive.service.cli.session.HiveSessionImpl.executeStatementAsync(HiveSessionImpl.java:517)
at sun.reflect.GeneratedMethodAccessor49.invoke(Unknown Source)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at
org.apache.hive.service.cli.session.HiveSessionProxy.invoke(HiveSessionProxy.java:78)
at
org.apache.hive.service.cli.session.HiveSessionProxy.access$000(HiveSessionProxy.java:36)
at
org.apache.hive.service.cli.session.HiveSessionProxy$1.run(HiveSessionProxy.java:63)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1844)
at
org.apache.hive.service.cli.session.HiveSessionProxy.invoke(HiveSessionProxy.java:59)
at com.sun.proxy.$Proxy37.executeStatementAsync(Unknown Source)
at
org.apache.hive.service.cli.CLIService.executeStatementAsync(CLIService.java:310)
at
org.apache.hive.service.cli.thrift.ThriftCLIService.ExecuteStatement(ThriftCLIService.java:530)
at
org.apache.hive.service.rpc.thrift.TCLIService$Processor$ExecuteStatement.getResult(TCLIService.java:1437)
at
org.apache.hive.service.rpc.thrift.TCLIService$Processor$ExecuteStatement.getResult(TCLIService.java:1422)
at org.apache.thrift.ProcessFunction.process(ProcessFunction.java:39)
at org.apache.thrift.TBaseProcessor.process(TBaseProcessor.java:39)
at
org.apache.hive.service.auth.TSetIpAddressProcessor.process(TSetIpAddressProcessor.java:56)
at
org.apache.thrift.server.TThreadPoolServer$WorkerProcess.run(TThreadPoolServer.java:286)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Caused by: org.apache.hadoop.hive.ql.parse.SemanticException: Partition not
found {year=2020, month=11, day=01}
at
org.apache.hadoop.hive.ql.parse.BaseSemanticAnalyzer.getPartition(BaseSemanticAnalyzer.java:1736)
at
org.apache.hadoop.hive.ql.parse.DDLSemanticAnalyzer.addInputsOutputsAlterTable(DDLSemanticAnalyzer.java:1515)
at
org.apache.hadoop.hive.ql.parse.DDLSemanticAnalyzer.addInputsOutputsAlterTable(DDLSemanticAnalyzer.java:1479)
at
org.apache.hadoop.hive.ql.parse.DDLSemanticAnalyzer.analyzeAlterTableLocation(DDLSemanticAnalyzer.java:1567)
at
org.apache.hadoop.hive.ql.parse.DDLSemanticAnalyzer.analyzeInternal(DDLSemanticAnalyzer.java:303)
at
org.apache.hadoop.hive.ql.parse.BaseSemanticAnalyzer.analyze(BaseSemanticAnalyzer.java:258)
at org.apache.hadoop.hive.ql.Driver.compile(Driver.java:512)
at org.apache.hadoop.hive.ql.Driver.compileInternal(Driver.java:1317)
at org.apache.hadoop.hive.ql.Driver.compileAndRespond(Driver.java:1295)
at
org.apache.hive.service.cli.operation.SQLOperation.prepare(SQLOperation.java:204)
... 26 more```
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]