subash-metica commented on issue #7057:
URL: https://github.com/apache/hudi/issues/7057#issuecomment-1878374530

   Hi,
   I am facing the issue again, the problem is happening for random instances 
and no common pattern I could see. 
   
   Hudi version: 0.13.1
   
   The error stack trace,
   
   ```
   org.apache.hudi.exception.HoodieIOException: Could not read commit details 
from s3://<dataset-location>/.hoodie/20240103002413315.replacecommit.requested
        at 
org.apache.hudi.common.table.timeline.HoodieActiveTimeline.readDataFromPath(HoodieActiveTimeline.java:824)
 ~[hudi-spark3-bundle_2.12-0.13.1-amzn-1.jar:0.13.1-amzn-1]
        at 
org.apache.hudi.common.table.timeline.HoodieActiveTimeline.getInstantDetails(HoodieActiveTimeline.java:310)
 ~[hudi-spark3-bundle_2.12-0.13.1-amzn-1.jar:0.13.1-amzn-1]
        at 
org.apache.hudi.common.util.ClusteringUtils.getRequestedReplaceMetadata(ClusteringUtils.java:93)
 ~[hudi-spark3-bundle_2.12-0.13.1-amzn-1.jar:0.13.1-amzn-1]
        at 
org.apache.hudi.common.util.ClusteringUtils.getClusteringPlan(ClusteringUtils.java:109)
 ~[hudi-spark3-bundle_2.12-0.13.1-amzn-1.jar:0.13.1-amzn-1]
        at 
org.apache.hudi.client.BaseHoodieTableServiceClient.lambda$getInflightTimelineExcludeCompactionAndClustering$7(BaseHoodieTableServiceClient.java:595)
 ~[hudi-spark3-bundle_2.12-0.13.1-amzn-1.jar:0.13.1-amzn-1]
        at 
java.util.stream.ReferencePipeline$2$1.accept(ReferencePipeline.java:178) ~[?:?]
        at 
java.util.ArrayList$ArrayListSpliterator.forEachRemaining(ArrayList.java:1625) 
~[?:?]
        at 
java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:509) ~[?:?]
        at 
java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:499) 
~[?:?]
        at 
java.util.stream.ReduceOps$ReduceOp.evaluateSequential(ReduceOps.java:921) 
~[?:?]
        at 
java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234) ~[?:?]
        at 
java.util.stream.ReferencePipeline.collect(ReferencePipeline.java:682) ~[?:?]
        at 
org.apache.hudi.common.table.timeline.HoodieDefaultTimeline.<init>(HoodieDefaultTimeline.java:58)
 ~[hudi-spark3-bundle_2.12-0.13.1-amzn-1.jar:0.13.1-amzn-1]
        at 
org.apache.hudi.common.table.timeline.HoodieDefaultTimeline.filter(HoodieDefaultTimeline.java:236)
 ~[hudi-spark3-bundle_2.12-0.13.1-amzn-1.jar:0.13.1-amzn-1]
        at 
org.apache.hudi.client.BaseHoodieTableServiceClient.getInflightTimelineExcludeCompactionAndClustering(BaseHoodieTableServiceClient.java:593)
 ~[hudi-spark3-bundle_2.12-0.13.1-amzn-1.jar:0.13.1-amzn-1]
        at 
org.apache.hudi.client.BaseHoodieTableServiceClient.getInstantsToRollback(BaseHoodieTableServiceClient.java:737)
 ~[hudi-spark3-bundle_2.12-0.13.1-amzn-1.jar:0.13.1-amzn-1]
        at 
org.apache.hudi.client.BaseHoodieTableServiceClient.rollbackFailedWrites(BaseHoodieTableServiceClient.java:706)
 ~[hudi-spark3-bundle_2.12-0.13.1-amzn-1.jar:0.13.1-amzn-1]
        at 
org.apache.hudi.client.BaseHoodieWriteClient.lambda$startCommitWithTime$97cdbdca$1(BaseHoodieWriteClient.java:844)
 ~[hudi-spark3-bundle_2.12-0.13.1-amzn-1.jar:0.13.1-amzn-1]
        at 
org.apache.hudi.common.util.CleanerUtils.rollbackFailedWrites(CleanerUtils.java:156)
 ~[hudi-spark3-bundle_2.12-0.13.1-amzn-1.jar:0.13.1-amzn-1]
        at 
org.apache.hudi.client.BaseHoodieWriteClient.startCommitWithTime(BaseHoodieWriteClient.java:843)
 ~[hudi-spark3-bundle_2.12-0.13.1-amzn-1.jar:0.13.1-amzn-1]
        at 
org.apache.hudi.client.BaseHoodieWriteClient.startCommitWithTime(BaseHoodieWriteClient.java:836)
 ~[hudi-spark3-bundle_2.12-0.13.1-amzn-1.jar:0.13.1-amzn-1]
        at 
org.apache.hudi.HoodieSparkSqlWriter$.write(HoodieSparkSqlWriter.scala:371) 
~[hudi-spark3-bundle_2.12-0.13.1-amzn-1.jar:0.13.1-amzn-1]
        at 
org.apache.hudi.DefaultSource.createRelation(DefaultSource.scala:151) 
~[hudi-spark3-bundle_2.12-0.13.1-amzn-1.jar:0.13.1-amzn-1]
        at 
org.apache.spark.sql.execution.datasources.SaveIntoDataSourceCommand.run(SaveIntoDataSourceCommand.scala:47)
 ~[spark-sql_2.12-3.4.1-amzn-0.jar:3.4.1-amzn-0]
        at 
org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:75)
 ~[spark-sql_2.12-3.4.1-amzn-0.jar:3.4.1-amzn-0]
        at 
org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:73)
 ~[spark-sql_2.12-3.4.1-amzn-0.jar:3.4.1-amzn-0]
        at 
org.apache.spark.sql.execution.command.ExecutedCommandExec.executeCollect(commands.scala:84)
 ~[spark-sql_2.12-3.4.1-amzn-0.jar:3.4.1-amzn-0]
        at 
org.apache.spark.sql.execution.QueryExecution$$anonfun$eagerlyExecuteCommands$1.$anonfun$applyOrElse$1(QueryExecution.scala:104)
 ~[spark-sql_2.12-3.4.1-amzn-0.jar:3.4.1-amzn-0]
        at 
org.apache.spark.sql.catalyst.QueryPlanningTracker$.withTracker(QueryPlanningTracker.scala:107)
 ~[spark-catalyst_2.12-3.4.1-amzn-0.jar:3.4.1-amzn-0]
        at 
org.apache.spark.sql.execution.SQLExecution$.withTracker(SQLExecution.scala:250)
 ~[spark-sql_2.12-3.4.1-amzn-0.jar:3.4.1-amzn-0]
        at 
org.apache.spark.sql.execution.SQLExecution$.executeQuery$1(SQLExecution.scala:123)
 ~[spark-sql_2.12-3.4.1-amzn-0.jar:3.4.1-amzn-0]
        at 
org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$9(SQLExecution.scala:160)
 ~[spark-sql_2.12-3.4.1-amzn-0.jar:3.4.1-amzn-0]
        at 
org.apache.spark.sql.catalyst.QueryPlanningTracker$.withTracker(QueryPlanningTracker.scala:107)
 ~[spark-catalyst_2.12-3.4.1-amzn-0.jar:3.4.1-amzn-0]
        at 
org.apache.spark.sql.execution.SQLExecution$.withTracker(SQLExecution.scala:250)
 ~[spark-sql_2.12-3.4.1-amzn-0.jar:3.4.1-amzn-0]
        at 
org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$8(SQLExecution.scala:160)
 ~[spark-sql_2.12-3.4.1-amzn-0.jar:3.4.1-amzn-0]
        at 
org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:271)
 ~[spark-sql_2.12-3.4.1-amzn-0.jar:3.4.1-amzn-0]
        at 
org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:159)
 ~[spark-sql_2.12-3.4.1-amzn-0.jar:3.4.1-amzn-0]
        at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:827) 
~[spark-sql_2.12-3.4.1-amzn-0.jar:3.4.1-amzn-0]
        at 
org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:69)
 ~[spark-sql_2.12-3.4.1-amzn-0.jar:3.4.1-amzn-0]
        at 
org.apache.spark.sql.execution.QueryExecution$$anonfun$eagerlyExecuteCommands$1.applyOrElse(QueryExecution.scala:101)
 ~[spark-sql_2.12-3.4.1-amzn-0.jar:3.4.1-amzn-0]
        at 
org.apache.spark.sql.execution.QueryExecution$$anonfun$eagerlyExecuteCommands$1.applyOrElse(QueryExecution.scala:97)
 ~[spark-sql_2.12-3.4.1-amzn-0.jar:3.4.1-amzn-0]
        at 
org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDownWithPruning$1(TreeNode.scala:554)
 ~[spark-catalyst_2.12-3.4.1-amzn-0.jar:3.4.1-amzn-0]
        at 
org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:107)
 ~[spark-catalyst_2.12-3.4.1-amzn-0.jar:3.4.1-amzn-0]
        at 
org.apache.spark.sql.catalyst.trees.TreeNode.transformDownWithPruning(TreeNode.scala:554)
 ~[spark-catalyst_2.12-3.4.1-amzn-0.jar:3.4.1-amzn-0]
        at 
org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.org$apache$spark$sql$catalyst$plans$logical$AnalysisHelper$$super$transformDownWithPruning(LogicalPlan.scala:32)
 ~[spark-catalyst_2.12-3.4.1-amzn-0.jar:3.4.1-amzn-0]
        at 
org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.transformDownWithPruning(AnalysisHelper.scala:267)
 ~[spark-catalyst_2.12-3.4.1-amzn-0.jar:3.4.1-amzn-0]
        at 
org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.transformDownWithPruning$(AnalysisHelper.scala:263)
 ~[spark-catalyst_2.12-3.4.1-amzn-0.jar:3.4.1-amzn-0]
        at 
org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.transformDownWithPruning(LogicalPlan.scala:32)
 ~[spark-catalyst_2.12-3.4.1-amzn-0.jar:3.4.1-amzn-0]
        at 
org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.transformDownWithPruning(LogicalPlan.scala:32)
 ~[spark-catalyst_2.12-3.4.1-amzn-0.jar:3.4.1-amzn-0]
        at 
org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:530) 
~[spark-catalyst_2.12-3.4.1-amzn-0.jar:3.4.1-amzn-0]
        at 
org.apache.spark.sql.execution.QueryExecution.eagerlyExecuteCommands(QueryExecution.scala:97)
 ~[spark-sql_2.12-3.4.1-amzn-0.jar:3.4.1-amzn-0]
        at 
org.apache.spark.sql.execution.QueryExecution.commandExecuted$lzycompute(QueryExecution.scala:84)
 ~[spark-sql_2.12-3.4.1-amzn-0.jar:3.4.1-amzn-0]
        at 
org.apache.spark.sql.execution.QueryExecution.commandExecuted(QueryExecution.scala:82)
 ~[spark-sql_2.12-3.4.1-amzn-0.jar:3.4.1-amzn-0]
        at 
org.apache.spark.sql.execution.QueryExecution.assertCommandExecuted(QueryExecution.scala:142)
 ~[spark-sql_2.12-3.4.1-amzn-0.jar:3.4.1-amzn-0]
        at 
org.apache.spark.sql.DataFrameWriter.runCommand(DataFrameWriter.scala:856) 
~[spark-sql_2.12-3.4.1-amzn-0.jar:3.4.1-amzn-0]
        at 
org.apache.spark.sql.DataFrameWriter.saveToV1Source(DataFrameWriter.scala:387) 
~[spark-sql_2.12-3.4.1-amzn-0.jar:3.4.1-amzn-0]
        at 
org.apache.spark.sql.DataFrameWriter.saveInternal(DataFrameWriter.scala:360) 
~[spark-sql_2.12-3.4.1-amzn-0.jar:3.4.1-amzn-0]
        at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:239) 
~[spark-sql_2.12-3.4.1-amzn-0.jar:3.4.1-amzn-0]
   
   Caused by: java.io.FileNotFoundException: No such file or directory 
's3:<dataset-location>/.hoodie/20240103002413315.replacecommit.requested'
        at 
com.amazon.ws.emr.hadoop.fs.s3n.S3NativeFileSystem.getFileStatus(S3NativeFileSystem.java:556)
 ~[emrfs-hadoop-assembly-2.58.0.jar:?]
        at 
com.amazon.ws.emr.hadoop.fs.s3n.S3NativeFileSystem.open(S3NativeFileSystem.java:988)
 ~[emrfs-hadoop-assembly-2.58.0.jar:?]
        at 
com.amazon.ws.emr.hadoop.fs.s3n.S3NativeFileSystem.open(S3NativeFileSystem.java:980)
 ~[emrfs-hadoop-assembly-2.58.0.jar:?]
        at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:983) 
~[hadoop-client-api-3.3.3-amzn-5.jar:?]
        at 
com.amazon.ws.emr.hadoop.fs.EmrFileSystem.open(EmrFileSystem.java:197) 
~[emrfs-hadoop-assembly-2.58.0.jar:?]
        at 
org.apache.hudi.common.fs.HoodieWrapperFileSystem.open(HoodieWrapperFileSystem.java:476)
 ~[hudi-spark3-bundle_2.12-0.13.1-amzn-1.jar:0.13.1-amzn-1]
        at 
org.apache.hudi.common.table.timeline.HoodieActiveTimeline.readDataFromPath(HoodieActiveTimeline.java:821)
 ~[hudi-spark3-bundle_2.12-0.13.1-amzn-1.jar:0.13.1-amzn-1]
        ... 94 more
   ```
   
   While looking at `timeline show incomplete`, below is the view
   <img width="674" alt="Screenshot 2024-01-05 at 14 41 40" 
src="https://github.com/apache/hudi/assets/142811976/21099107-4570-408a-b5aa-958efcf2f31d";>
   
   
   This instant happened randomly, nothing changes on our side - we have a job 
which dumps the data as Hudi periodically every hour. Suddenly, at one 
execution - the replacecommit.inflight and replacecommit.requested timestamp is 
different (not sure how it (could) happen)
   As a workaround, I renamed the replacecommit.requested to be same as 
.inflight timestamp and reran for that partition to fix the data. It worked 
fine but not sure how to reproduce the same unfortunately.
   
   Could it be a race condition ? or any alternative for this issue ?
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to