[I] [SUPPORT] PATH_NOT_FOUND Error when Running Hudi Incremental Query [hudi]

via GitHub Wed, 24 Jul 2024 17:13:41 -0700


Jason-liujc opened a new issue, #11684:
URL: https://github.com/apache/hudi/issues/11684


   
   **Describe the problem you faced**
   
   When running Hudi incremental query for a past time range where cleaner 
already cleaned the underlying Hudi files, we see a `PATH_NOT_FOUND` error
   
   **To Reproduce**
   
   Steps to reproduce the behavior:
   
   1. Async clean an existing Hudi table using Keep File Versions criteria. The 
cleaner cleaned commits within time range `t1` and `t4`
   2. Run a spark job that reads all updates between time range `t2` and `t3`, 
where t1 < t2 < t3 < t4
   3. Job fails with error:
   
   ```
   Caused by: org.apache.spark.sql.AnalysisException: [PATH_NOT_FOUND] Path 
does not exist: 
s3://profit-sphere-cdo-datastore-data-prod/version=1/tenant=Amazon/database=ComponentOutputs/table=WarehouseDealsLiquidationRevenue/DwRegionPartition=NA/ShipDayPartition=2024-06-25/958d5ceb-85cd-4068-843b-c2eeaa23de2a-0_1089-24-38711_20240629163803936.parquet.
        at 
org.apache.spark.sql.errors.QueryCompilationErrors$.dataPathNotExistError(QueryCompilationErrors.scala:1424)
        at 
org.apache.spark.sql.execution.datasources.DataSource$.$anonfun$checkAndGlobPathIfNecessary$4(DataSource.scala:757)
        at 
org.apache.spark.sql.execution.datasources.DataSource$.$anonfun$checkAndGlobPathIfNecessary$4$adapted(DataSource.scala:754)
        at 
org.apache.spark.util.ThreadUtils$.$anonfun$parmap$2(ThreadUtils.scala:393)
        at scala.concurrent.Future$.$anonfun$apply$1(Future.scala:659)
   ```
   
   **Expected behavior**
   
   Hudi should throw a clearer error like `Cannot perform incremental query 
between time range t2 and t3. Some files might have been deleted by the cleaner`
   
   **Environment Description**
   
   * Hudi version : 0.14.0
   
   * Spark version : 3.4
   
   * Storage (HDFS/S3/GCS..) : S3
   
   * Running on Docker? (yes/no) : No
   
   
   **Additional context**
   
   Running on AWS EMR 6.15 version.
   
   It's understandable the incremental query would fail in this scenario, just 
wish the error is better. Discussed this with @xushiyan  offline and he 
mentioned to file an issue
   
   **Stacktrace**
   
   ```
   Caused by: org.apache.spark.sql.AnalysisException: [PATH_NOT_FOUND] Path 
does not exist: 
s3://profit-sphere-cdo-datastore-data-prod/version=1/tenant=Amazon/database=ComponentOutputs/table=WarehouseDealsLiquidationRevenue/DwRegionPartition=NA/ShipDayPartition=2024-06-25/958d5ceb-85cd-4068-843b-c2eeaa23de2a-0_1089-24-38711_20240629163803936.parquet.
        at 
org.apache.spark.sql.errors.QueryCompilationErrors$.dataPathNotExistError(QueryCompilationErrors.scala:1424)
        at 
org.apache.spark.sql.execution.datasources.DataSource$.$anonfun$checkAndGlobPathIfNecessary$4(DataSource.scala:757)
        at 
org.apache.spark.sql.execution.datasources.DataSource$.$anonfun$checkAndGlobPathIfNecessary$4$adapted(DataSource.scala:754)
        at 
org.apache.spark.util.ThreadUtils$.$anonfun$parmap$2(ThreadUtils.scala:393)
        at scala.concurrent.Future$.$anonfun$apply$1(Future.scala:659)
        at scala.util.Success.$anonfun$map$1(Try.scala:255)
        at scala.util.Success.map(Try.scala:213)
        at scala.concurrent.Future.$anonfun$map$1(Future.scala:292)
        at scala.concurrent.impl.Promise.liftedTree1$1(Promise.scala:33)
   ```
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[I] [SUPPORT] PATH_NOT_FOUND Error when Running Hudi Incremental Query [hudi]

Reply via email to