Jason-liujc opened a new issue, #11684:
URL: https://github.com/apache/hudi/issues/11684
**Describe the problem you faced**
When running Hudi incremental query for a past time range where cleaner
already cleaned the underlying Hudi files, we see a `PATH_NOT_FOUND` error
**To Reproduce**
Steps to reproduce the behavior:
1. Async clean an existing Hudi table using Keep File Versions criteria. The
cleaner cleaned commits within time range `t1` and `t4`
2. Run a spark job that reads all updates between time range `t2` and `t3`,
where t1 < t2 < t3 < t4
3. Job fails with error:
```
Caused by: org.apache.spark.sql.AnalysisException: [PATH_NOT_FOUND] Path
does not exist:
s3://profit-sphere-cdo-datastore-data-prod/version=1/tenant=Amazon/database=ComponentOutputs/table=WarehouseDealsLiquidationRevenue/DwRegionPartition=NA/ShipDayPartition=2024-06-25/958d5ceb-85cd-4068-843b-c2eeaa23de2a-0_1089-24-38711_20240629163803936.parquet.
at
org.apache.spark.sql.errors.QueryCompilationErrors$.dataPathNotExistError(QueryCompilationErrors.scala:1424)
at
org.apache.spark.sql.execution.datasources.DataSource$.$anonfun$checkAndGlobPathIfNecessary$4(DataSource.scala:757)
at
org.apache.spark.sql.execution.datasources.DataSource$.$anonfun$checkAndGlobPathIfNecessary$4$adapted(DataSource.scala:754)
at
org.apache.spark.util.ThreadUtils$.$anonfun$parmap$2(ThreadUtils.scala:393)
at scala.concurrent.Future$.$anonfun$apply$1(Future.scala:659)
```
**Expected behavior**
Hudi should throw a clearer error like `Cannot perform incremental query
between time range t2 and t3. Some files might have been deleted by the cleaner`
**Environment Description**
* Hudi version : 0.14.0
* Spark version : 3.4
* Storage (HDFS/S3/GCS..) : S3
* Running on Docker? (yes/no) : No
**Additional context**
Running on AWS EMR 6.15 version.
It's understandable the incremental query would fail in this scenario, just
wish the error is better. Discussed this with @xushiyan offline and he
mentioned to file an issue
**Stacktrace**
```
Caused by: org.apache.spark.sql.AnalysisException: [PATH_NOT_FOUND] Path
does not exist:
s3://profit-sphere-cdo-datastore-data-prod/version=1/tenant=Amazon/database=ComponentOutputs/table=WarehouseDealsLiquidationRevenue/DwRegionPartition=NA/ShipDayPartition=2024-06-25/958d5ceb-85cd-4068-843b-c2eeaa23de2a-0_1089-24-38711_20240629163803936.parquet.
at
org.apache.spark.sql.errors.QueryCompilationErrors$.dataPathNotExistError(QueryCompilationErrors.scala:1424)
at
org.apache.spark.sql.execution.datasources.DataSource$.$anonfun$checkAndGlobPathIfNecessary$4(DataSource.scala:757)
at
org.apache.spark.sql.execution.datasources.DataSource$.$anonfun$checkAndGlobPathIfNecessary$4$adapted(DataSource.scala:754)
at
org.apache.spark.util.ThreadUtils$.$anonfun$parmap$2(ThreadUtils.scala:393)
at scala.concurrent.Future$.$anonfun$apply$1(Future.scala:659)
at scala.util.Success.$anonfun$map$1(Try.scala:255)
at scala.util.Success.map(Try.scala:213)
at scala.concurrent.Future.$anonfun$map$1(Future.scala:292)
at scala.concurrent.impl.Promise.liftedTree1$1(Promise.scala:33)
```
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]