subash-metica commented on issue #7057:
URL: https://github.com/apache/hudi/issues/7057#issuecomment-1878374530
Hi,
I am facing the issue again, the problem is happening for random instances
and no common pattern I could see.
Hudi version: 0.13.1
The error stack trace,
```
org.apache.hudi.exception.HoodieIOException: Could not read commit details
from s3://<dataset-location>/.hoodie/20240103002413315.replacecommit.requested
at
org.apache.hudi.common.table.timeline.HoodieActiveTimeline.readDataFromPath(HoodieActiveTimeline.java:824)
~[hudi-spark3-bundle_2.12-0.13.1-amzn-1.jar:0.13.1-amzn-1]
at
org.apache.hudi.common.table.timeline.HoodieActiveTimeline.getInstantDetails(HoodieActiveTimeline.java:310)
~[hudi-spark3-bundle_2.12-0.13.1-amzn-1.jar:0.13.1-amzn-1]
at
org.apache.hudi.common.util.ClusteringUtils.getRequestedReplaceMetadata(ClusteringUtils.java:93)
~[hudi-spark3-bundle_2.12-0.13.1-amzn-1.jar:0.13.1-amzn-1]
at
org.apache.hudi.common.util.ClusteringUtils.getClusteringPlan(ClusteringUtils.java:109)
~[hudi-spark3-bundle_2.12-0.13.1-amzn-1.jar:0.13.1-amzn-1]
at
org.apache.hudi.client.BaseHoodieTableServiceClient.lambda$getInflightTimelineExcludeCompactionAndClustering$7(BaseHoodieTableServiceClient.java:595)
~[hudi-spark3-bundle_2.12-0.13.1-amzn-1.jar:0.13.1-amzn-1]
at
java.util.stream.ReferencePipeline$2$1.accept(ReferencePipeline.java:178) ~[?:?]
at
java.util.ArrayList$ArrayListSpliterator.forEachRemaining(ArrayList.java:1625)
~[?:?]
at
java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:509) ~[?:?]
at
java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:499)
~[?:?]
at
java.util.stream.ReduceOps$ReduceOp.evaluateSequential(ReduceOps.java:921)
~[?:?]
at
java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234) ~[?:?]
at
java.util.stream.ReferencePipeline.collect(ReferencePipeline.java:682) ~[?:?]
at
org.apache.hudi.common.table.timeline.HoodieDefaultTimeline.<init>(HoodieDefaultTimeline.java:58)
~[hudi-spark3-bundle_2.12-0.13.1-amzn-1.jar:0.13.1-amzn-1]
at
org.apache.hudi.common.table.timeline.HoodieDefaultTimeline.filter(HoodieDefaultTimeline.java:236)
~[hudi-spark3-bundle_2.12-0.13.1-amzn-1.jar:0.13.1-amzn-1]
at
org.apache.hudi.client.BaseHoodieTableServiceClient.getInflightTimelineExcludeCompactionAndClustering(BaseHoodieTableServiceClient.java:593)
~[hudi-spark3-bundle_2.12-0.13.1-amzn-1.jar:0.13.1-amzn-1]
at
org.apache.hudi.client.BaseHoodieTableServiceClient.getInstantsToRollback(BaseHoodieTableServiceClient.java:737)
~[hudi-spark3-bundle_2.12-0.13.1-amzn-1.jar:0.13.1-amzn-1]
at
org.apache.hudi.client.BaseHoodieTableServiceClient.rollbackFailedWrites(BaseHoodieTableServiceClient.java:706)
~[hudi-spark3-bundle_2.12-0.13.1-amzn-1.jar:0.13.1-amzn-1]
at
org.apache.hudi.client.BaseHoodieWriteClient.lambda$startCommitWithTime$97cdbdca$1(BaseHoodieWriteClient.java:844)
~[hudi-spark3-bundle_2.12-0.13.1-amzn-1.jar:0.13.1-amzn-1]
at
org.apache.hudi.common.util.CleanerUtils.rollbackFailedWrites(CleanerUtils.java:156)
~[hudi-spark3-bundle_2.12-0.13.1-amzn-1.jar:0.13.1-amzn-1]
at
org.apache.hudi.client.BaseHoodieWriteClient.startCommitWithTime(BaseHoodieWriteClient.java:843)
~[hudi-spark3-bundle_2.12-0.13.1-amzn-1.jar:0.13.1-amzn-1]
at
org.apache.hudi.client.BaseHoodieWriteClient.startCommitWithTime(BaseHoodieWriteClient.java:836)
~[hudi-spark3-bundle_2.12-0.13.1-amzn-1.jar:0.13.1-amzn-1]
at
org.apache.hudi.HoodieSparkSqlWriter$.write(HoodieSparkSqlWriter.scala:371)
~[hudi-spark3-bundle_2.12-0.13.1-amzn-1.jar:0.13.1-amzn-1]
at
org.apache.hudi.DefaultSource.createRelation(DefaultSource.scala:151)
~[hudi-spark3-bundle_2.12-0.13.1-amzn-1.jar:0.13.1-amzn-1]
at
org.apache.spark.sql.execution.datasources.SaveIntoDataSourceCommand.run(SaveIntoDataSourceCommand.scala:47)
~[spark-sql_2.12-3.4.1-amzn-0.jar:3.4.1-amzn-0]
at
org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:75)
~[spark-sql_2.12-3.4.1-amzn-0.jar:3.4.1-amzn-0]
at
org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:73)
~[spark-sql_2.12-3.4.1-amzn-0.jar:3.4.1-amzn-0]
at
org.apache.spark.sql.execution.command.ExecutedCommandExec.executeCollect(commands.scala:84)
~[spark-sql_2.12-3.4.1-amzn-0.jar:3.4.1-amzn-0]
at
org.apache.spark.sql.execution.QueryExecution$$anonfun$eagerlyExecuteCommands$1.$anonfun$applyOrElse$1(QueryExecution.scala:104)
~[spark-sql_2.12-3.4.1-amzn-0.jar:3.4.1-amzn-0]
at
org.apache.spark.sql.catalyst.QueryPlanningTracker$.withTracker(QueryPlanningTracker.scala:107)
~[spark-catalyst_2.12-3.4.1-amzn-0.jar:3.4.1-amzn-0]
at
org.apache.spark.sql.execution.SQLExecution$.withTracker(SQLExecution.scala:250)
~[spark-sql_2.12-3.4.1-amzn-0.jar:3.4.1-amzn-0]
at
org.apache.spark.sql.execution.SQLExecution$.executeQuery$1(SQLExecution.scala:123)
~[spark-sql_2.12-3.4.1-amzn-0.jar:3.4.1-amzn-0]
at
org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$9(SQLExecution.scala:160)
~[spark-sql_2.12-3.4.1-amzn-0.jar:3.4.1-amzn-0]
at
org.apache.spark.sql.catalyst.QueryPlanningTracker$.withTracker(QueryPlanningTracker.scala:107)
~[spark-catalyst_2.12-3.4.1-amzn-0.jar:3.4.1-amzn-0]
at
org.apache.spark.sql.execution.SQLExecution$.withTracker(SQLExecution.scala:250)
~[spark-sql_2.12-3.4.1-amzn-0.jar:3.4.1-amzn-0]
at
org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$8(SQLExecution.scala:160)
~[spark-sql_2.12-3.4.1-amzn-0.jar:3.4.1-amzn-0]
at
org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:271)
~[spark-sql_2.12-3.4.1-amzn-0.jar:3.4.1-amzn-0]
at
org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:159)
~[spark-sql_2.12-3.4.1-amzn-0.jar:3.4.1-amzn-0]
at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:827)
~[spark-sql_2.12-3.4.1-amzn-0.jar:3.4.1-amzn-0]
at
org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:69)
~[spark-sql_2.12-3.4.1-amzn-0.jar:3.4.1-amzn-0]
at
org.apache.spark.sql.execution.QueryExecution$$anonfun$eagerlyExecuteCommands$1.applyOrElse(QueryExecution.scala:101)
~[spark-sql_2.12-3.4.1-amzn-0.jar:3.4.1-amzn-0]
at
org.apache.spark.sql.execution.QueryExecution$$anonfun$eagerlyExecuteCommands$1.applyOrElse(QueryExecution.scala:97)
~[spark-sql_2.12-3.4.1-amzn-0.jar:3.4.1-amzn-0]
at
org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDownWithPruning$1(TreeNode.scala:554)
~[spark-catalyst_2.12-3.4.1-amzn-0.jar:3.4.1-amzn-0]
at
org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:107)
~[spark-catalyst_2.12-3.4.1-amzn-0.jar:3.4.1-amzn-0]
at
org.apache.spark.sql.catalyst.trees.TreeNode.transformDownWithPruning(TreeNode.scala:554)
~[spark-catalyst_2.12-3.4.1-amzn-0.jar:3.4.1-amzn-0]
at
org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.org$apache$spark$sql$catalyst$plans$logical$AnalysisHelper$$super$transformDownWithPruning(LogicalPlan.scala:32)
~[spark-catalyst_2.12-3.4.1-amzn-0.jar:3.4.1-amzn-0]
at
org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.transformDownWithPruning(AnalysisHelper.scala:267)
~[spark-catalyst_2.12-3.4.1-amzn-0.jar:3.4.1-amzn-0]
at
org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.transformDownWithPruning$(AnalysisHelper.scala:263)
~[spark-catalyst_2.12-3.4.1-amzn-0.jar:3.4.1-amzn-0]
at
org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.transformDownWithPruning(LogicalPlan.scala:32)
~[spark-catalyst_2.12-3.4.1-amzn-0.jar:3.4.1-amzn-0]
at
org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.transformDownWithPruning(LogicalPlan.scala:32)
~[spark-catalyst_2.12-3.4.1-amzn-0.jar:3.4.1-amzn-0]
at
org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:530)
~[spark-catalyst_2.12-3.4.1-amzn-0.jar:3.4.1-amzn-0]
at
org.apache.spark.sql.execution.QueryExecution.eagerlyExecuteCommands(QueryExecution.scala:97)
~[spark-sql_2.12-3.4.1-amzn-0.jar:3.4.1-amzn-0]
at
org.apache.spark.sql.execution.QueryExecution.commandExecuted$lzycompute(QueryExecution.scala:84)
~[spark-sql_2.12-3.4.1-amzn-0.jar:3.4.1-amzn-0]
at
org.apache.spark.sql.execution.QueryExecution.commandExecuted(QueryExecution.scala:82)
~[spark-sql_2.12-3.4.1-amzn-0.jar:3.4.1-amzn-0]
at
org.apache.spark.sql.execution.QueryExecution.assertCommandExecuted(QueryExecution.scala:142)
~[spark-sql_2.12-3.4.1-amzn-0.jar:3.4.1-amzn-0]
at
org.apache.spark.sql.DataFrameWriter.runCommand(DataFrameWriter.scala:856)
~[spark-sql_2.12-3.4.1-amzn-0.jar:3.4.1-amzn-0]
at
org.apache.spark.sql.DataFrameWriter.saveToV1Source(DataFrameWriter.scala:387)
~[spark-sql_2.12-3.4.1-amzn-0.jar:3.4.1-amzn-0]
at
org.apache.spark.sql.DataFrameWriter.saveInternal(DataFrameWriter.scala:360)
~[spark-sql_2.12-3.4.1-amzn-0.jar:3.4.1-amzn-0]
at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:239)
~[spark-sql_2.12-3.4.1-amzn-0.jar:3.4.1-amzn-0]
Caused by: java.io.FileNotFoundException: No such file or directory
's3:<dataset-location>/.hoodie/20240103002413315.replacecommit.requested'
at
com.amazon.ws.emr.hadoop.fs.s3n.S3NativeFileSystem.getFileStatus(S3NativeFileSystem.java:556)
~[emrfs-hadoop-assembly-2.58.0.jar:?]
at
com.amazon.ws.emr.hadoop.fs.s3n.S3NativeFileSystem.open(S3NativeFileSystem.java:988)
~[emrfs-hadoop-assembly-2.58.0.jar:?]
at
com.amazon.ws.emr.hadoop.fs.s3n.S3NativeFileSystem.open(S3NativeFileSystem.java:980)
~[emrfs-hadoop-assembly-2.58.0.jar:?]
at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:983)
~[hadoop-client-api-3.3.3-amzn-5.jar:?]
at
com.amazon.ws.emr.hadoop.fs.EmrFileSystem.open(EmrFileSystem.java:197)
~[emrfs-hadoop-assembly-2.58.0.jar:?]
at
org.apache.hudi.common.fs.HoodieWrapperFileSystem.open(HoodieWrapperFileSystem.java:476)
~[hudi-spark3-bundle_2.12-0.13.1-amzn-1.jar:0.13.1-amzn-1]
at
org.apache.hudi.common.table.timeline.HoodieActiveTimeline.readDataFromPath(HoodieActiveTimeline.java:821)
~[hudi-spark3-bundle_2.12-0.13.1-amzn-1.jar:0.13.1-amzn-1]
... 94 more
```
While looking at `timeline show incomplete`, below is the view
<img width="674" alt="Screenshot 2024-01-05 at 14 41 40"
src="https://github.com/apache/hudi/assets/142811976/21099107-4570-408a-b5aa-958efcf2f31d">
This instant happened randomly, nothing changes on our side - we have a job
which dumps the data as Hudi periodically every hour. Suddenly, at one
execution - the replacecommit.inflight and replacecommit.requested timestamp is
different (not sure how it (could) happen)
As a workaround, I renamed the replacecommit.requested to be same as
.inflight timestamp and reran for that partition to fix the data. It worked
fine but not sure how to reproduce the same unfortunately.
Could it be a race condition ? or any alternative for this issue ?
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]