ssdong commented on pull request #2784:
URL: https://github.com/apache/hudi/pull/2784#issuecomment-820928674
@satishkotha @lw309637554 Just to share some updates, this PR fixed the
following 2 issues during archival
1. `Positive number of partitions required`
2. `java.util.NoSuchElementException: No value present in Option`
However, the aforementioned
```
// Initialize with new Hoodie timeline.
init(metaClient, getTimeline());
```
does cause
`java.io.FileNotFoundException: File
file:/Users/susu.dong/Dev/clustering-insert-overwrite-test/.hoodie/20210415220131.replacecommit
does not exist` which is a 3rd issue during archival
if we turn _off_ `"hoodie.clean.automatic"`, the cleaner option, which is
`true` by default.
Turning off cleaner is making the internally maintained timeline to be
out-of-sync with the physical commit file status. The archival removes the
commit files while the `init` call still references those commit files that are
being removed/archived when it propagates to the `readDataFromPath` method call
and throws the exception ultimately.
Full stacktrace:
```
org.apache.hudi.exception.HoodieIOException: Could not read commit details
from
/Users/susu.dong/Dev/clustering-insert-overwrite-test/.hoodie/20210415220131.replacecommit
at
org.apache.hudi.common.table.timeline.HoodieActiveTimeline.readDataFromPath(HoodieActiveTimeline.java:561)
at
org.apache.hudi.common.table.timeline.HoodieActiveTimeline.getInstantDetails(HoodieActiveTimeline.java:225)
at
org.apache.hudi.common.table.view.AbstractTableFileSystemView.lambda$resetFileGroupsReplaced$8(AbstractTableFileSystemView.java:217)
at
java.base/java.util.stream.ReferencePipeline$7$1.accept(ReferencePipeline.java:271)
at
java.base/java.util.ArrayList$ArrayListSpliterator.forEachRemaining(ArrayList.java:1654)
at
java.base/java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:484)
at
java.base/java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:474)
at
java.base/java.util.stream.ReduceOps$ReduceOp.evaluateSequential(ReduceOps.java:913)
at
java.base/java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234)
at
java.base/java.util.stream.ReferencePipeline.collect(ReferencePipeline.java:578)
at
org.apache.hudi.common.table.view.AbstractTableFileSystemView.resetFileGroupsReplaced(AbstractTableFileSystemView.java:228)
at
org.apache.hudi.common.table.view.AbstractTableFileSystemView.init(AbstractTableFileSystemView.java:106)
at
org.apache.hudi.common.table.view.HoodieTableFileSystemView.init(HoodieTableFileSystemView.java:106)
at
org.apache.hudi.common.table.view.AbstractTableFileSystemView.reset(AbstractTableFileSystemView.java:248)
at
org.apache.hudi.common.table.view.HoodieTableFileSystemView.close(HoodieTableFileSystemView.java:353)
at
java.base/java.util.concurrent.ConcurrentHashMap$ValuesView.forEach(ConcurrentHashMap.java:4772)
at
org.apache.hudi.common.table.view.FileSystemViewManager.close(FileSystemViewManager.java:118)
at
org.apache.hudi.timeline.service.TimelineService.close(TimelineService.java:207)
at
org.apache.hudi.client.embedded.EmbeddedTimelineService.stop(EmbeddedTimelineService.java:119)
at
org.apache.hudi.client.AbstractHoodieClient.stopEmbeddedServerView(AbstractHoodieClient.java:94)
at
org.apache.hudi.client.AbstractHoodieClient.close(AbstractHoodieClient.java:86)
at
org.apache.hudi.client.AbstractHoodieWriteClient.close(AbstractHoodieWriteClient.java:1047)
at
org.apache.hudi.HoodieSparkSqlWriter$.commitAndPerformPostOperations(HoodieSparkSqlWriter.scala:505)
at
org.apache.hudi.HoodieSparkSqlWriter$.write(HoodieSparkSqlWriter.scala:225)
at org.apache.hudi.DefaultSource.createRelation(DefaultSource.scala:161)
at
org.apache.spark.sql.execution.datasources.SaveIntoDataSourceCommand.run(SaveIntoDataSourceCommand.scala:46)
at
org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:70)
at
org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:68)
at
org.apache.spark.sql.execution.command.ExecutedCommandExec.doExecute(commands.scala:90)
at
org.apache.spark.sql.execution.SparkPlan.$anonfun$execute$1(SparkPlan.scala:175)
at
org.apache.spark.sql.execution.SparkPlan.$anonfun$executeQuery$1(SparkPlan.scala:213)
at
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at
org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:210)
at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:171)
at
org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:122)
at
org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:121)
at
org.apache.spark.sql.DataFrameWriter.$anonfun$runCommand$1(DataFrameWriter.scala:963)
at
org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$5(SQLExecution.scala:100)
at
org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:160)
at
org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:87)
at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:764)
at
org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:64)
at
org.apache.spark.sql.DataFrameWriter.runCommand(DataFrameWriter.scala:963)
at
org.apache.spark.sql.DataFrameWriter.saveToV1Source(DataFrameWriter.scala:415)
at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:399)
at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:288)
```
We don't have to fix it in this PR, however, we do need to find a better
solution without having to load the full active timeline that breaks some
fundamental assumption. Is it possible for one of you to reach out to @bvaradar
to get his attention on this? Thanks!
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]