t0il3ts0ap opened a new issue #2934:
URL: https://github.com/apache/hudi/issues/2934
**Describe the problem you faced**
My aim is to read an existing hudi table (COW) using deltastreamer, do some
transformations and write it to another ( fresh ) table. I am using
deltastreamer so as check-pointing can be automated.
Relevant hudi configs used for deltastreamer
```
--hoodie-conf hoodie.parquet.compression.codec=snappy
--table-type COPY_ON_WRITE
--source-class org.apache.hudi.utilities.sources.HoodieIncrSource
--hoodie-conf
hoodie.deltastreamer.source.hoodieincr.path=s3://poc-bucket/raw-data/customer_service/credit_analysis_data
--hoodie-conf
hoodie.deltastreamer.source.hoodieincr.partition.extractor.class=org.apache.hudi.hive.NonPartitionedExtractor
--hoodie-conf hoodie.deltastreamer.source.hoodieincr.partition.fields=''
--hoodie-conf hoodie.deltastreamer.source.hoodieincr.num_instants=1
--enable-sync
--checkpoint 0
```
The first run of deltastreamer failed with
```
Exception in thread "main" org.apache.spark.sql.AnalysisException: Path does
not exist:
s3://poc-bucket/raw-data/customer_service/credit_analysis_data/default/c67b4ac1-4597-4896-81c5-dc70b2f62892-1_0-23-13659_20210508062123.parquet;
at
org.apache.spark.sql.execution.datasources.DataSource$.$anonfun$checkAndGlobPathIfNecessary$1(DataSource.scala:764)
at
scala.collection.TraversableLike.$anonfun$flatMap$1(TraversableLike.scala:245)
at scala.collection.immutable.List.foreach(List.scala:392)
at scala.collection.TraversableLike.flatMap(TraversableLike.scala:245)
at scala.collection.TraversableLike.flatMap$(TraversableLike.scala:242)
at scala.collection.immutable.List.flatMap(List.scala:355)
at
org.apache.spark.sql.execution.datasources.DataSource$.checkAndGlobPathIfNecessary(DataSource.scala:751)
at
org.apache.spark.sql.execution.datasources.DataSource.checkAndGlobPathIfNecessary(DataSource.scala:580)
at
org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:405)
at
org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:297)
at
org.apache.spark.sql.DataFrameReader.$anonfun$load$2(DataFrameReader.scala:286)
at scala.Option.getOrElse(Option.scala:189)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:286)
at
org.apache.spark.sql.DataFrameReader.parquet(DataFrameReader.scala:755)
at
org.apache.hudi.IncrementalRelation.buildScan(IncrementalRelation.scala:151)
at
org.apache.spark.sql.execution.datasources.DataSourceStrategy.apply(DataSourceStrategy.scala:313)
at
org.apache.spark.sql.catalyst.planning.QueryPlanner.$anonfun$plan$1(QueryPlanner.scala:63)
at scala.collection.Iterator$$anon$11.nextCur(Iterator.scala:484)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:490)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:489)
at
org.apache.spark.sql.catalyst.planning.QueryPlanner.plan(QueryPlanner.scala:93)
at
org.apache.spark.sql.execution.SparkStrategies.plan(SparkStrategies.scala:69)
at
org.apache.spark.sql.catalyst.planning.QueryPlanner.$anonfun$plan$3(QueryPlanner.scala:78)
at
scala.collection.TraversableOnce.$anonfun$foldLeft$1(TraversableOnce.scala:162)
at
scala.collection.TraversableOnce.$anonfun$foldLeft$1$adapted(TraversableOnce.scala:162)
at scala.collection.Iterator.foreach(Iterator.scala:941)
at scala.collection.Iterator.foreach$(Iterator.scala:941)
```
The original table is couple of months old. At any moment I find commits
spanning over last 3 days in its .hoodie directory.
Surprisingly, the parquet file mentioned in most of commits does not exist.
I am able to obtain same error when trying to run hudi incremental query in
spark-shell.
* Hudi version : 0.7.0
* Spark version : 3.0.2 with scala 2.12
* Storage (HDFS/S3/GCS..) :S3
* Running on Docker? (yes/no) : no
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]