Re: [I] [SUPPORT] [OCC] HoodieException: Error getting all file groups in pending clustering [hudi]

via GitHub Fri, 05 Jan 2024 01:32:21 -0800


subash-metica commented on issue #7057:
URL: https://github.com/apache/hudi/issues/7057#issuecomment-1878374530


   Hi,
   I am facing the issue again, the problem is happening for random instances 
and no common pattern I could see. 
   
   Hudi version: 0.13.1
   
   The error stack trace,
   
   ```
   org.apache.hudi.exception.HoodieIOException: Could not read commit details 
from s3://<dataset-location>/.hoodie/20240103002413315.replacecommit.requested
        at 
org.apache.hudi.common.table.timeline.HoodieActiveTimeline.readDataFromPath(HoodieActiveTimeline.java:824)
 ~[hudi-spark3-bundle_2.12-0.13.1-amzn-1.jar:0.13.1-amzn-1]
        at 
org.apache.hudi.common.table.timeline.HoodieActiveTimeline.getInstantDetails(HoodieActiveTimeline.java:310)
 ~[hudi-spark3-bundle_2.12-0.13.1-amzn-1.jar:0.13.1-amzn-1]
        at 
org.apache.hudi.common.util.ClusteringUtils.getRequestedReplaceMetadata(ClusteringUtils.java:93)
 ~[hudi-spark3-bundle_2.12-0.13.1-amzn-1.jar:0.13.1-amzn-1]
        at 
org.apache.hudi.common.util.ClusteringUtils.getClusteringPlan(ClusteringUtils.java:109)
 ~[hudi-spark3-bundle_2.12-0.13.1-amzn-1.jar:0.13.1-amzn-1]
        at 
org.apache.hudi.client.BaseHoodieTableServiceClient.lambda$getInflightTimelineExcludeCompactionAndClustering$7(BaseHoodieTableServiceClient.java:595)
 ~[hudi-spark3-bundle_2.12-0.13.1-amzn-1.jar:0.13.1-amzn-1]
        at 
java.util.stream.ReferencePipeline$2$1.accept(ReferencePipeline.java:178) ~[?:?]
        at 
java.util.ArrayList$ArrayListSpliterator.forEachRemaining(ArrayList.java:1625) 
~[?:?]
        at 
java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:509) ~[?:?]
        at 
java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:499) 
~[?:?]
        at 
java.util.stream.ReduceOps$ReduceOp.evaluateSequential(ReduceOps.java:921) 
~[?:?]
        at 
java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234) ~[?:?]
        at 
java.util.stream.ReferencePipeline.collect(ReferencePipeline.java:682) ~[?:?]
        at 
org.apache.hudi.common.table.timeline.HoodieDefaultTimeline.<init>(HoodieDefaultTimeline.java:58)
 ~[hudi-spark3-bundle_2.12-0.13.1-amzn-1.jar:0.13.1-amzn-1]
        at 
org.apache.hudi.common.table.timeline.HoodieDefaultTimeline.filter(HoodieDefaultTimeline.java:236)
 ~[hudi-spark3-bundle_2.12-0.13.1-amzn-1.jar:0.13.1-amzn-1]
        at 
org.apache.hudi.client.BaseHoodieTableServiceClient.getInflightTimelineExcludeCompactionAndClustering(BaseHoodieTableServiceClient.java:593)
 ~[hudi-spark3-bundle_2.12-0.13.1-amzn-1.jar:0.13.1-amzn-1]
        at 
org.apache.hudi.client.BaseHoodieTableServiceClient.getInstantsToRollback(BaseHoodieTableServiceClient.java:737)
 ~[hudi-spark3-bundle_2.12-0.13.1-amzn-1.jar:0.13.1-amzn-1]
        at 
org.apache.hudi.client.BaseHoodieTableServiceClient.rollbackFailedWrites(BaseHoodieTableServiceClient.java:706)
 ~[hudi-spark3-bundle_2.12-0.13.1-amzn-1.jar:0.13.1-amzn-1]
        at 
org.apache.hudi.client.BaseHoodieWriteClient.lambda$startCommitWithTime$97cdbdca$1(BaseHoodieWriteClient.java:844)
 ~[hudi-spark3-bundle_2.12-0.13.1-amzn-1.jar:0.13.1-amzn-1]
        at 
org.apache.hudi.common.util.CleanerUtils.rollbackFailedWrites(CleanerUtils.java:156)
 ~[hudi-spark3-bundle_2.12-0.13.1-amzn-1.jar:0.13.1-amzn-1]
        at 
org.apache.hudi.client.BaseHoodieWriteClient.startCommitWithTime(BaseHoodieWriteClient.java:843)
 ~[hudi-spark3-bundle_2.12-0.13.1-amzn-1.jar:0.13.1-amzn-1]
        at 
org.apache.hudi.client.BaseHoodieWriteClient.startCommitWithTime(BaseHoodieWriteClient.java:836)
 ~[hudi-spark3-bundle_2.12-0.13.1-amzn-1.jar:0.13.1-amzn-1]
        at 
org.apache.hudi.HoodieSparkSqlWriter$.write(HoodieSparkSqlWriter.scala:371) 
~[hudi-spark3-bundle_2.12-0.13.1-amzn-1.jar:0.13.1-amzn-1]
        at 
org.apache.hudi.DefaultSource.createRelation(DefaultSource.scala:151) 
~[hudi-spark3-bundle_2.12-0.13.1-amzn-1.jar:0.13.1-amzn-1]
        at 
org.apache.spark.sql.execution.datasources.SaveIntoDataSourceCommand.run(SaveIntoDataSourceCommand.scala:47)
 ~[spark-sql_2.12-3.4.1-amzn-0.jar:3.4.1-amzn-0]
        at 
org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:75)
 ~[spark-sql_2.12-3.4.1-amzn-0.jar:3.4.1-amzn-0]
        at 
org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:73)
 ~[spark-sql_2.12-3.4.1-amzn-0.jar:3.4.1-amzn-0]
        at 
org.apache.spark.sql.execution.command.ExecutedCommandExec.executeCollect(commands.scala:84)
 ~[spark-sql_2.12-3.4.1-amzn-0.jar:3.4.1-amzn-0]
        at 
org.apache.spark.sql.execution.QueryExecution$$anonfun$eagerlyExecuteCommands$1.$anonfun$applyOrElse$1(QueryExecution.scala:104)
 ~[spark-sql_2.12-3.4.1-amzn-0.jar:3.4.1-amzn-0]
        at 
org.apache.spark.sql.catalyst.QueryPlanningTracker$.withTracker(QueryPlanningTracker.scala:107)
 ~[spark-catalyst_2.12-3.4.1-amzn-0.jar:3.4.1-amzn-0]
        at 
org.apache.spark.sql.execution.SQLExecution$.withTracker(SQLExecution.scala:250)
 ~[spark-sql_2.12-3.4.1-amzn-0.jar:3.4.1-amzn-0]
        at 
org.apache.spark.sql.execution.SQLExecution$.executeQuery$1(SQLExecution.scala:123)
 ~[spark-sql_2.12-3.4.1-amzn-0.jar:3.4.1-amzn-0]
        at 
org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$9(SQLExecution.scala:160)
 ~[spark-sql_2.12-3.4.1-amzn-0.jar:3.4.1-amzn-0]
        at 
org.apache.spark.sql.catalyst.QueryPlanningTracker$.withTracker(QueryPlanningTracker.scala:107)
 ~[spark-catalyst_2.12-3.4.1-amzn-0.jar:3.4.1-amzn-0]
        at 
org.apache.spark.sql.execution.SQLExecution$.withTracker(SQLExecution.scala:250)
 ~[spark-sql_2.12-3.4.1-amzn-0.jar:3.4.1-amzn-0]
        at 
org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$8(SQLExecution.scala:160)
 ~[spark-sql_2.12-3.4.1-amzn-0.jar:3.4.1-amzn-0]
        at 
org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:271)
 ~[spark-sql_2.12-3.4.1-amzn-0.jar:3.4.1-amzn-0]
        at 
org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:159)
 ~[spark-sql_2.12-3.4.1-amzn-0.jar:3.4.1-amzn-0]
        at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:827) 
~[spark-sql_2.12-3.4.1-amzn-0.jar:3.4.1-amzn-0]
        at 
org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:69)
 ~[spark-sql_2.12-3.4.1-amzn-0.jar:3.4.1-amzn-0]
        at 
org.apache.spark.sql.execution.QueryExecution$$anonfun$eagerlyExecuteCommands$1.applyOrElse(QueryExecution.scala:101)
 ~[spark-sql_2.12-3.4.1-amzn-0.jar:3.4.1-amzn-0]
        at 
org.apache.spark.sql.execution.QueryExecution$$anonfun$eagerlyExecuteCommands$1.applyOrElse(QueryExecution.scala:97)
 ~[spark-sql_2.12-3.4.1-amzn-0.jar:3.4.1-amzn-0]
        at 
org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDownWithPruning$1(TreeNode.scala:554)
 ~[spark-catalyst_2.12-3.4.1-amzn-0.jar:3.4.1-amzn-0]
        at 
org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:107)
 ~[spark-catalyst_2.12-3.4.1-amzn-0.jar:3.4.1-amzn-0]
        at 
org.apache.spark.sql.catalyst.trees.TreeNode.transformDownWithPruning(TreeNode.scala:554)
 ~[spark-catalyst_2.12-3.4.1-amzn-0.jar:3.4.1-amzn-0]
        at 
org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.org$apache$spark$sql$catalyst$plans$logical$AnalysisHelper$$super$transformDownWithPruning(LogicalPlan.scala:32)
 ~[spark-catalyst_2.12-3.4.1-amzn-0.jar:3.4.1-amzn-0]
        at 
org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.transformDownWithPruning(AnalysisHelper.scala:267)
 ~[spark-catalyst_2.12-3.4.1-amzn-0.jar:3.4.1-amzn-0]
        at 
org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.transformDownWithPruning$(AnalysisHelper.scala:263)
 ~[spark-catalyst_2.12-3.4.1-amzn-0.jar:3.4.1-amzn-0]
        at 
org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.transformDownWithPruning(LogicalPlan.scala:32)
 ~[spark-catalyst_2.12-3.4.1-amzn-0.jar:3.4.1-amzn-0]
        at 
org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.transformDownWithPruning(LogicalPlan.scala:32)
 ~[spark-catalyst_2.12-3.4.1-amzn-0.jar:3.4.1-amzn-0]
        at 
org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:530) 
~[spark-catalyst_2.12-3.4.1-amzn-0.jar:3.4.1-amzn-0]
        at 
org.apache.spark.sql.execution.QueryExecution.eagerlyExecuteCommands(QueryExecution.scala:97)
 ~[spark-sql_2.12-3.4.1-amzn-0.jar:3.4.1-amzn-0]
        at 
org.apache.spark.sql.execution.QueryExecution.commandExecuted$lzycompute(QueryExecution.scala:84)
 ~[spark-sql_2.12-3.4.1-amzn-0.jar:3.4.1-amzn-0]
        at 
org.apache.spark.sql.execution.QueryExecution.commandExecuted(QueryExecution.scala:82)
 ~[spark-sql_2.12-3.4.1-amzn-0.jar:3.4.1-amzn-0]
        at 
org.apache.spark.sql.execution.QueryExecution.assertCommandExecuted(QueryExecution.scala:142)
 ~[spark-sql_2.12-3.4.1-amzn-0.jar:3.4.1-amzn-0]
        at 
org.apache.spark.sql.DataFrameWriter.runCommand(DataFrameWriter.scala:856) 
~[spark-sql_2.12-3.4.1-amzn-0.jar:3.4.1-amzn-0]
        at 
org.apache.spark.sql.DataFrameWriter.saveToV1Source(DataFrameWriter.scala:387) 
~[spark-sql_2.12-3.4.1-amzn-0.jar:3.4.1-amzn-0]
        at 
org.apache.spark.sql.DataFrameWriter.saveInternal(DataFrameWriter.scala:360) 
~[spark-sql_2.12-3.4.1-amzn-0.jar:3.4.1-amzn-0]
        at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:239) 
~[spark-sql_2.12-3.4.1-amzn-0.jar:3.4.1-amzn-0]
   
   Caused by: java.io.FileNotFoundException: No such file or directory 
's3:<dataset-location>/.hoodie/20240103002413315.replacecommit.requested'
        at 
com.amazon.ws.emr.hadoop.fs.s3n.S3NativeFileSystem.getFileStatus(S3NativeFileSystem.java:556)
 ~[emrfs-hadoop-assembly-2.58.0.jar:?]
        at 
com.amazon.ws.emr.hadoop.fs.s3n.S3NativeFileSystem.open(S3NativeFileSystem.java:988)
 ~[emrfs-hadoop-assembly-2.58.0.jar:?]
        at 
com.amazon.ws.emr.hadoop.fs.s3n.S3NativeFileSystem.open(S3NativeFileSystem.java:980)
 ~[emrfs-hadoop-assembly-2.58.0.jar:?]
        at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:983) 
~[hadoop-client-api-3.3.3-amzn-5.jar:?]
        at 
com.amazon.ws.emr.hadoop.fs.EmrFileSystem.open(EmrFileSystem.java:197) 
~[emrfs-hadoop-assembly-2.58.0.jar:?]
        at 
org.apache.hudi.common.fs.HoodieWrapperFileSystem.open(HoodieWrapperFileSystem.java:476)
 ~[hudi-spark3-bundle_2.12-0.13.1-amzn-1.jar:0.13.1-amzn-1]
        at 
org.apache.hudi.common.table.timeline.HoodieActiveTimeline.readDataFromPath(HoodieActiveTimeline.java:821)
 ~[hudi-spark3-bundle_2.12-0.13.1-amzn-1.jar:0.13.1-amzn-1]
        ... 94 more
   ```
   
   While looking at `timeline show incomplete`, below is the view
   <img width="674" alt="Screenshot 2024-01-05 at 14 41 40" 
src="https://github.com/apache/hudi/assets/142811976/21099107-4570-408a-b5aa-958efcf2f31d";>
   
   
   This instant happened randomly, nothing changes on our side - we have a job 
which dumps the data as Hudi periodically every hour. Suddenly, at one 
execution - the replacecommit.inflight and replacecommit.requested timestamp is 
different (not sure how it (could) happen)
   As a workaround, I renamed the replacecommit.requested to be same as 
.inflight timestamp and reran for that partition to fix the data. It worked 
fine but not sure how to reproduce the same unfortunately.
   
   Could it be a race condition ? or any alternative for this issue ?
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [I] [SUPPORT] [OCC] HoodieException: Error getting all file groups in pending clustering [hudi]

Reply via email to