[GitHub] [hudi] t0il3ts0ap opened a new issue #2934: [SUPPORT] Parquet file does not exist when trying to read hudi table incrementally

GitBox Mon, 10 May 2021 06:59:58 -0700


t0il3ts0ap opened a new issue #2934:
URL: https://github.com/apache/hudi/issues/2934



   **Describe the problem you faced**
   
   My aim is to read an existing hudi table (COW) using deltastreamer, do some 
transformations and write it to another ( fresh ) table. I am using 
deltastreamer so as check-pointing can be automated.
   
   Relevant hudi configs used for deltastreamer 
   ```
   --hoodie-conf hoodie.parquet.compression.codec=snappy 
   --table-type COPY_ON_WRITE 
   --source-class org.apache.hudi.utilities.sources.HoodieIncrSource 
   --hoodie-conf 
hoodie.deltastreamer.source.hoodieincr.path=s3://poc-bucket/raw-data/customer_service/credit_analysis_data
 
   --hoodie-conf 
hoodie.deltastreamer.source.hoodieincr.partition.extractor.class=org.apache.hudi.hive.NonPartitionedExtractor
 
   --hoodie-conf hoodie.deltastreamer.source.hoodieincr.partition.fields='' 
   --hoodie-conf hoodie.deltastreamer.source.hoodieincr.num_instants=1 
   --enable-sync 
   --checkpoint 0
   ```
   
   
   The first run of deltastreamer failed with 
   ```
   Exception in thread "main" org.apache.spark.sql.AnalysisException: Path does 
not exist: 
s3://poc-bucket/raw-data/customer_service/credit_analysis_data/default/c67b4ac1-4597-4896-81c5-dc70b2f62892-1_0-23-13659_20210508062123.parquet;
        at 
org.apache.spark.sql.execution.datasources.DataSource$.$anonfun$checkAndGlobPathIfNecessary$1(DataSource.scala:764)
        at 
scala.collection.TraversableLike.$anonfun$flatMap$1(TraversableLike.scala:245)
        at scala.collection.immutable.List.foreach(List.scala:392)
        at scala.collection.TraversableLike.flatMap(TraversableLike.scala:245)
        at scala.collection.TraversableLike.flatMap$(TraversableLike.scala:242)
        at scala.collection.immutable.List.flatMap(List.scala:355)
        at 
org.apache.spark.sql.execution.datasources.DataSource$.checkAndGlobPathIfNecessary(DataSource.scala:751)
        at 
org.apache.spark.sql.execution.datasources.DataSource.checkAndGlobPathIfNecessary(DataSource.scala:580)
        at 
org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:405)
        at 
org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:297)
        at 
org.apache.spark.sql.DataFrameReader.$anonfun$load$2(DataFrameReader.scala:286)
        at scala.Option.getOrElse(Option.scala:189)
        at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:286)
        at 
org.apache.spark.sql.DataFrameReader.parquet(DataFrameReader.scala:755)
        at 
org.apache.hudi.IncrementalRelation.buildScan(IncrementalRelation.scala:151)
        at 
org.apache.spark.sql.execution.datasources.DataSourceStrategy.apply(DataSourceStrategy.scala:313)
        at 
org.apache.spark.sql.catalyst.planning.QueryPlanner.$anonfun$plan$1(QueryPlanner.scala:63)
        at scala.collection.Iterator$$anon$11.nextCur(Iterator.scala:484)
        at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:490)
        at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:489)
        at 
org.apache.spark.sql.catalyst.planning.QueryPlanner.plan(QueryPlanner.scala:93)
        at 
org.apache.spark.sql.execution.SparkStrategies.plan(SparkStrategies.scala:69)
        at 
org.apache.spark.sql.catalyst.planning.QueryPlanner.$anonfun$plan$3(QueryPlanner.scala:78)
        at 
scala.collection.TraversableOnce.$anonfun$foldLeft$1(TraversableOnce.scala:162)
        at 
scala.collection.TraversableOnce.$anonfun$foldLeft$1$adapted(TraversableOnce.scala:162)
        at scala.collection.Iterator.foreach(Iterator.scala:941)
        at scala.collection.Iterator.foreach$(Iterator.scala:941)
   ``` 
   
   The original table is couple of months old. At any moment I find commits 
spanning over last 3 days in its .hoodie directory. 
   Surprisingly, the parquet file mentioned in most of commits does not exist.
   
   I am able to obtain same error when trying to run hudi incremental query in 
spark-shell.
   
   * Hudi version : 0.7.0
   
   * Spark version : 3.0.2 with scala 2.12
   
   * Storage (HDFS/S3/GCS..) :S3
   
   * Running on Docker? (yes/no) : no
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [hudi] t0il3ts0ap opened a new issue #2934: [SUPPORT] Parquet file does not exist when trying to read hudi table incrementally

Reply via email to