remeajayi2022 opened a new issue, #10519:
URL: https://github.com/apache/hudi/issues/10519

   **_Tips before filing an issue_**
   
   - Have you gone through our [FAQs](https://hudi.apache.org/learn/faq/)?
   Yes
   
   - Join the mailing list to engage in conversations and get faster support at 
[email protected].
   
   - If you have triaged this as a bug, then file an 
[issue](https://issues.apache.org/jira/projects/HUDI/issues) directly.
   
   **Describe the problem you faced**
   During a spark structured streaming read from a Hudi table, the spark job 
fails while trying to access cleaned parquet files on S3. To address this, I 
have set the `hoodie.datasource.read.incr.fallback.fulltablescan.enable` to 
true. After setting this configuration, the spark job runs from ~20minutes and 
fails again due to a FileNotFoundException. This job is designed to read from a 
Hudi Table A and write to a Hudi Table B.  
   
   **To Reproduce**
   1. Create a Deltastreamer job that reads from a Kafka topic and writes to 
Table A with the following configurations:
   
   ```
            --Cleaning Policy
           "hoodie.cleaner.policy":"KEEP_LATEST_COMMITS" #default
           "hoodie.cleaner.commits.retained":"10" #default
           
           --Archiving Policy
           "hoodie.keep.min.commits":"100"
           "hoodie.keep.max.commits":"130"
   ```
   
   2. Create a Pyspark structured streaming job that reads from Table B with 
the following configurations:
     
   ```
             "hoodie.enable.data.skipping": "true"
             "hoodie.datasource.read.incr.fallback.fulltablescan.enable": "true"
   
             "spark.hadoop.fs.s3a.impl": 
"org.apache.hadoop.fs.s3a.S3AFileSystem"
             "spark.hadoop.fs.s3a.aws.credentials.provider": 
("org.apache.hadoop.fs.s3a.SimpleAWSCredentialsProvider" )
             "hoodie.metadata.enable": "true"
   
              "hoodie.datasource.query.type": "incremental"
             "hoodie.datasource.read.begin.instanttime": "20231106000000"
   ```
   
     
   
   3. Let the job in step 2 fall (probably by pausing) behind long enough to 
access commits with cleaned files. This is to ensure that the `fulltablescan` 
configuration is addressing the `PathNotFound` error.
   
   **Expected behavior**
   The Pyspark job should successfully read from Table A and write to Table B. 
There should be parquet files present in Table B.
   
   **Environment Description**
   
   * Hudi version : 0.14.0
   
   * Spark version : 3.1.1
   
   * Hive version :
   
   * Hadoop version :
   
   * Storage (HDFS/S3/GCS..) : S3
   
   * Running on Docker? (yes/no) : No
   
   
   **Additional context**
   There is no data written by the job while it's doing a full table scan for 
those 20 minutes before failure. Is this the expected behaviour?
   
   Please note that the Hudi JAR I'm using is the stable 0.14.0 release built 
with the changes to `HoodieAsyncService.Java`, `WriteStatus.java` and 
`TimestampBasedAvroKeyGenerator.java` shown 
[here](https://github.com/apache/hudi/compare/master...sydneyhoran:hudi:master).`
 
   
   Add any other context about the problem here.
   
   **Stacktrace**
   
   ```24/01/17 03:57:07 INFO org.apache.hudi.IncrementalRelation: Inferring 
schema..
   24/01/17 03:57:14 INFO org.apache.hudi.IncrementalRelation: Falling back to 
full table scan as startInstantArchived: true, endInstantArchived: false
   24/01/17 03:57:14 INFO org.apache.hudi.DataSourceUtils: Getting table path..
   24/01/17 03:57:14 INFO org.apache.hudi.common.util.TablePathUtils: Getting 
table path from path : s3a://bucket_name/table_a_prefix_901_ia/public/table_name
   24/01/17 03:57:14 INFO org.apache.hudi.DefaultSource: Obtained hudi table 
path: s3a://bucket_name/table_a_prefix_901_ia/public/table_name
   24/01/17 03:57:14 INFO org.apache.hudi.common.table.HoodieTableMetaClient: 
Loading HoodieTableMetaClient from 
s3a://bucket_name/table_a_prefix_901_ia/public/table_name
   24/01/17 03:57:15 INFO org.apache.hudi.common.table.HoodieTableConfig: 
Loading table properties from 
s3a://bucket_name/table_a_prefix_901_ia/public/table_name/.hoodie/hoodie.properties
   24/01/17 03:57:15 INFO org.apache.hudi.common.table.HoodieTableMetaClient: 
Finished Loading Table of type COPY_ON_WRITE(version=1, baseFileFormat=PARQUET) 
from s3a://bucket_name/table_a_prefix_901_ia/public/table_name
   24/01/17 03:57:15 INFO org.apache.hudi.DefaultSource: Is bootstrapped table 
=> false, tableType is: COPY_ON_WRITE, queryType is: snapshot
   24/01/17 03:57:15 INFO 
org.apache.hudi.common.table.timeline.HoodieActiveTimeline: Loaded instants 
upto : Option{val=[==>20240117035712711__clean__REQUESTED__20240117035716000]}
   24/01/17 03:57:15 INFO org.apache.hudi.common.table.HoodieTableConfig: 
Loading table properties from 
s3a://bucket_name/table_prefix_901_ia/public/table_name/.hoodie/hoodie.properties
   24/01/17 03:57:15 INFO 
org.apache.hudi.common.table.timeline.HoodieActiveTimeline: Loaded instants 
upto : Option{val=[==>20240117035712711__clean__REQUESTED__20240117035716000]}
   24/01/17 03:57:15 INFO org.apache.hudi.BaseHoodieTableFileIndex: Refresh 
table production-table_a_prefix_ia.public.table_name, spent: 445 ms
   24/01/17 03:57:43 INFO org.apache.hudi.HoodieFileIndex: No partition 
predicates provided, listing full table (816 partitions)
   24/01/17 03:58:42 INFO 
org.apache.hudi.common.table.view.AbstractTableFileSystemView: Took 0 ms to 
read  0 instants, 0 replaced file groups
   24/01/17 03:58:42 INFO org.apache.hudi.common.util.ClusteringUtils: Found 0 
files in pending clustering operations
   24/01/17 03:58:42 INFO org.apache.hudi.HoodieFileIndex: Total file slices: 
816; candidate file slices after data skipping: 816; skipping percentage 0.0
   24/01/17 03:58:42 INFO org.apache.hudi.common.table.HoodieTableMetaClient: 
Loading HoodieTableMetaClient from 
s3a://bucket_name/table_a_prefix_901_in/public/table_name
   24/01/17 03:58:42 INFO org.apache.hudi.common.table.HoodieTableConfig: 
Loading table properties from 
s3a://bucket_name/table_a_prefix_901_in/public/table_name/.hoodie/hoodie.properties
   24/01/17 03:58:43 INFO org.apache.hudi.common.table.HoodieTableMetaClient: 
Finished Loading Table of type COPY_ON_WRITE(version=1, baseFileFormat=PARQUET) 
from s3a://bucket_name/table_a_prefix_901_in/public/table_name
   24/01/17 03:58:43 INFO org.apache.hudi.common.table.HoodieTableConfig: 
Loading table properties from 
s3a://bucket_name/table_a_prefix_901_in/public/table_name/.hoodie/hoodie.properties
   24/01/17 03:58:43 INFO 
org.apache.hudi.common.table.view.FileSystemViewManager: Creating View Manager 
with storage type :MEMORY
   24/01/17 03:58:43 INFO 
org.apache.hudi.common.table.view.FileSystemViewManager: Creating in-memory 
based Table View
   24/01/17 03:58:43 INFO 
org.apache.hudi.common.table.timeline.HoodieActiveTimeline: Loaded instants 
upto : Option{val=[==>20240117035839093__clean__INFLIGHT__20240117035843000]}
   24/01/17 03:58:43 INFO org.apache.hudi.IncrementalRelation: Inferring 
schema..
   24/01/17 03:58:49 INFO org.apache.hudi.IncrementalRelation: Falling back to 
full table scan as startInstantArchived: true, endInstantArchived: false
   24/01/17 03:58:49 INFO org.apache.hudi.DataSourceUtils: Getting table path..
   24/01/17 03:58:49 INFO org.apache.hudi.common.util.TablePathUtils: Getting 
table path from path : s3a://bucket_name/table_a_prefix_901_in/public/table_name
   24/01/17 03:58:49 INFO org.apache.hudi.DefaultSource: Obtained hudi table 
path: s3a://bucket_name/table_a_prefix_901_in/public/table_name
   24/01/17 03:58:49 INFO org.apache.hudi.common.table.HoodieTableMetaClient: 
Loading HoodieTableMetaClient from 
s3a://bucket_name/table_a_prefix_901_in/public/table_name
   24/01/17 03:58:49 INFO org.apache.hudi.common.table.HoodieTableConfig: 
Loading table properties from 
s3a://bucket_name/table_a_prefix_901_in/public/table_name/.hoodie/hoodie.properties
   24/01/17 03:58:50 INFO org.apache.hudi.common.table.HoodieTableMetaClient: 
Finished Loading Table of type COPY_ON_WRITE(version=1, baseFileFormat=PARQUET) 
from s3a://bucket_name/table_a_prefix_901_in/public/table_name
   24/01/17 03:58:50 INFO org.apache.hudi.DefaultSource: Is bootstrapped table 
=> false, tableType is: COPY_ON_WRITE, queryType is: snapshot
   24/01/17 03:58:50 INFO 
org.apache.hudi.common.table.timeline.HoodieActiveTimeline: Loaded instants 
upto : Option{val=[==>20240117035848705__commit__REQUESTED__20240117035850000]}
   24/01/17 03:58:50 INFO org.apache.hudi.common.table.HoodieTableConfig: 
Loading table properties from 
s3a://bucket_name/table_a_prefix_901_in/public/table_name/.hoodie/hoodie.properties
   24/01/17 03:58:50 INFO 
org.apache.hudi.common.table.timeline.HoodieActiveTimeline: Loaded instants 
upto : Option{val=[==>20240117035848705__commit__REQUESTED__20240117035850000]}
   24/01/17 03:58:50 INFO org.apache.hudi.BaseHoodieTableFileIndex: Refresh 
table production-table_a_prefix_in_a.public.table_name, spent: 489 ms
   24/01/17 03:59:22 INFO org.apache.hudi.HoodieFileIndex: No partition 
predicates provided, listing full table (985 partitions)
   24/01/17 04:00:37 INFO 
org.apache.hudi.common.table.view.AbstractTableFileSystemView: Took 0 ms to 
read  0 instants, 0 replaced file groups
   24/01/17 04:00:37 INFO org.apache.hudi.common.util.ClusteringUtils: Found 0 
files in pending clustering operations
   24/01/17 04:00:37 INFO org.apache.hudi.HoodieFileIndex: Total file slices: 
985; candidate file slices after data skipping: 985; skipping percentage 0.0
   24/01/17 04:00:37 INFO org.apache.hudi.common.table.HoodieTableMetaClient: 
Loading HoodieTableMetaClient from 
s3a://bucket_name/table_a_prefix_901_md/public/table_name
   24/01/17 04:00:38 INFO org.apache.hudi.common.table.HoodieTableConfig: 
Loading table properties from 
s3a://bucket_name/table_a_prefix_901_md/public/table_name/.hoodie/hoodie.properties
   24/01/17 04:00:38 INFO org.apache.hudi.common.table.HoodieTableMetaClient: 
Finished Loading Table of type COPY_ON_WRITE(version=1, baseFileFormat=PARQUET) 
from s3a://bucket_name/table_a_prefix_901_md/public/table_name
   24/01/17 04:00:38 INFO org.apache.hudi.common.table.HoodieTableConfig: 
Loading table properties from 
s3a://bucket_name/table_a_prefix_901_md/public/table_name/.hoodie/hoodie.properties
   24/01/17 04:00:38 INFO 
org.apache.hudi.common.table.view.FileSystemViewManager: Creating View Manager 
with storage type :MEMORY
   24/01/17 04:00:38 INFO 
org.apache.hudi.common.table.view.FileSystemViewManager: Creating in-memory 
based Table View
   24/01/17 04:00:38 INFO 
org.apache.hudi.common.table.timeline.HoodieActiveTimeline: Loaded instants 
upto : Option{val=[==>20240117040019208__commit__INFLIGHT__20240117040034000]}
   24/01/17 04:00:38 INFO org.apache.hudi.IncrementalRelation: Inferring 
schema..
   24/01/17 04:00:45 INFO org.apache.hudi.IncrementalRelation: Falling back to 
full table scan as startInstantArchived: true, endInstantArchived: false
   24/01/17 04:00:45 INFO org.apache.hudi.DataSourceUtils: Getting table path..
   24/01/17 04:00:45 INFO org.apache.hudi.common.util.TablePathUtils: Getting 
table path from path : s3a://bucket_name/table_a_prefix_901_md/public/table_name
   24/01/17 04:00:46 INFO org.apache.hudi.DefaultSource: Obtained hudi table 
path: s3a://bucket_name/table_a_prefix_901_md/public/table_name
   24/01/17 04:00:46 INFO org.apache.hudi.common.table.HoodieTableMetaClient: 
Loading HoodieTableMetaClient from 
s3a://bucket_name/table_a_prefix_901_md/public/table_name
   24/01/17 04:00:46 INFO org.apache.hudi.common.table.HoodieTableConfig: 
Loading table properties from 
s3a://bucket_name/table_a_prefix_901_md/public/table_name/.hoodie/hoodie.properties
   24/01/17 04:00:46 INFO org.apache.hudi.common.table.HoodieTableMetaClient: 
Finished Loading Table of type COPY_ON_WRITE(version=1, baseFileFormat=PARQUET) 
from s3a://bucket_name/table_a_prefix_901_md/public/table_name
   24/01/17 04:00:46 INFO org.apache.hudi.DefaultSource: Is bootstrapped table 
=> false, tableType is: COPY_ON_WRITE, queryType is: snapshot
   24/01/17 04:00:46 INFO 
org.apache.hudi.common.table.timeline.HoodieActiveTimeline: Loaded instants 
upto : Option{val=[20240117040041145__clean__COMPLETED__20240117040046000]}
   24/01/17 04:00:46 INFO org.apache.hudi.common.table.HoodieTableConfig: 
Loading table properties from 
s3a://bucket_name/table_a_prefix_901_md/public/table_name/.hoodie/hoodie.properties
   24/01/17 04:00:47 INFO 
org.apache.hudi.common.table.timeline.HoodieActiveTimeline: Loaded instants 
upto : Option{val=[20240117040041145__clean__COMPLETED__20240117040046000]}
   24/01/17 04:00:47 INFO org.apache.hudi.BaseHoodieTableFileIndex: Refresh 
table production-table_a_prefix_md_a.public.table_name, spent: 432 ms
   24/01/17 04:01:02 INFO org.apache.hudi.HoodieFileIndex: No partition 
predicates provided, listing full table (426 partitions)
   24/01/17 04:01:33 INFO 
org.apache.hudi.common.table.view.AbstractTableFileSystemView: Took 0 ms to 
read  0 instants, 0 replaced file groups
   24/01/17 04:01:33 INFO org.apache.hudi.common.util.ClusteringUtils: Found 0 
files in pending clustering operations
   24/01/17 04:01:33 INFO org.apache.hudi.HoodieFileIndex: Total file slices: 
426; candidate file slices after data skipping: 426; skipping percentage 0.0
   24/01/17 04:01:34 INFO 
org.apache.hudi.common.table.timeline.HoodieActiveTimeline: Loaded instants 
upto : Option{val=[20240117034200131__rollback__COMPLETED__20240117034204000]}
   24/01/17 04:01:34 INFO org.apache.hudi.common.table.HoodieTableMetaClient: 
Loading HoodieTableMetaClient from s3a://bucket_name_b/table_b_prefixtable_name/
   24/01/17 04:01:34 INFO org.apache.hudi.common.table.HoodieTableConfig: 
Loading table properties from 
s3a://bucket_name_b/table_b_prefixtable_name/.hoodie/hoodie.properties
   24/01/17 04:01:34 INFO org.apache.hudi.common.table.HoodieTableMetaClient: 
Finished Loading Table of type COPY_ON_WRITE(version=1, baseFileFormat=PARQUET) 
from s3a://bucket_name_b/table_b_prefixtable_name/
   24/01/17 04:01:34 INFO org.apache.hudi.common.table.HoodieTableMetaClient: 
Loading HoodieTableMetaClient from s3a://bucket_name_b/table_b_prefixtable_name/
   24/01/17 04:01:34 INFO org.apache.hudi.common.table.HoodieTableConfig: 
Loading table properties from 
s3a://bucket_name_b/table_b_prefixtable_name/.hoodie/hoodie.properties
   24/01/17 04:01:34 INFO org.apache.hudi.common.table.HoodieTableMetaClient: 
Finished Loading Table of type COPY_ON_WRITE(version=1, baseFileFormat=PARQUET) 
from s3a://bucket_name_b/table_b_prefixtable_name/
   24/01/17 04:01:35 INFO 
org.apache.hudi.common.table.timeline.HoodieActiveTimeline: Loaded instants 
upto : Option{val=[20240117034200131__rollback__COMPLETED__20240117034204000]}
   24/01/17 04:01:35 WARN org.apache.hudi.config.HoodieWriteConfig: Embedded 
timeline server is disabled, fallback to use direct marker type for spark
   24/01/17 04:01:35 INFO org.apache.hudi.client.BaseHoodieClient: Embedded 
Timeline Server is disabled. Not starting timeline service
   24/01/17 04:01:35 INFO org.apache.hudi.client.BaseHoodieClient: Embedded 
Timeline Server is disabled. Not starting timeline service
   24/01/17 04:01:35 INFO org.apache.hudi.HoodieSparkSqlWriter$: 
Config.inlineCompactionEnabled ? false
   24/01/17 04:01:35 INFO org.apache.hudi.HoodieSparkSqlWriter$: 
Config.asyncClusteringEnabled ? false
   24/01/17 04:01:35 INFO org.apache.hudi.common.table.HoodieTableMetaClient: 
Loading HoodieTableMetaClient from s3a://bucket_name_b/table_b_prefixtable_name/
   24/01/17 04:01:35 INFO org.apache.hudi.common.table.HoodieTableConfig: 
Loading table properties from 
s3a://bucket_name_b/table_b_prefixtable_name/.hoodie/hoodie.properties
   24/01/17 04:01:35 INFO org.apache.hudi.common.table.HoodieTableMetaClient: 
Finished Loading Table of type COPY_ON_WRITE(version=1, baseFileFormat=PARQUET) 
from s3a://bucket_name_b/table_b_prefixtable_name/
   24/01/17 04:01:35 INFO org.apache.hudi.common.table.HoodieTableMetaClient: 
Loading Active commit timeline for s3a://bucket_name_b/table_b_prefixtable_name/
   24/01/17 04:01:35 INFO 
org.apache.hudi.common.table.timeline.HoodieActiveTimeline: Loaded instants 
upto : Option{val=[20240117034200131__rollback__COMPLETED__20240117034204000]}
   24/01/17 04:01:35 INFO org.apache.hudi.common.util.CleanerUtils: Cleaned 
failed attempts if any
   24/01/17 04:01:35 INFO org.apache.hudi.common.table.HoodieTableMetaClient: 
Loading HoodieTableMetaClient from s3a://bucket_name_b/table_b_prefixtable_name/
   24/01/17 04:01:35 INFO org.apache.hudi.common.table.HoodieTableConfig: 
Loading table properties from 
s3a://bucket_name_b/table_b_prefixtable_name/.hoodie/hoodie.properties
   24/01/17 04:01:35 INFO org.apache.hudi.common.table.HoodieTableMetaClient: 
Finished Loading Table of type COPY_ON_WRITE(version=1, baseFileFormat=PARQUET) 
from s3a://bucket_name_b/table_b_prefixtable_name/
   24/01/17 04:01:35 INFO org.apache.hudi.common.table.HoodieTableMetaClient: 
Loading Active commit timeline for s3a://bucket_name_b/table_b_prefixtable_name/
   24/01/17 04:01:36 INFO 
org.apache.hudi.common.table.timeline.HoodieActiveTimeline: Loaded instants 
upto : Option{val=[20240117034200131__rollback__COMPLETED__20240117034204000]}
   24/01/17 04:01:36 INFO org.apache.hudi.common.table.HoodieTableConfig: 
Loading table properties from 
s3a://bucket_name_b/table_b_prefixtable_name/.hoodie/hoodie.properties
   24/01/17 04:01:36 INFO 
org.apache.hudi.common.table.view.FileSystemViewManager: Creating View Manager 
with storage type :MEMORY
   24/01/17 04:01:36 INFO 
org.apache.hudi.common.table.view.FileSystemViewManager: Creating in-memory 
based Table View
   24/01/17 04:01:36 INFO org.apache.hudi.client.BaseHoodieWriteClient: Begin 
rollback of instant 20240117034159138
   24/01/17 04:01:36 INFO org.apache.hudi.common.table.HoodieTableMetaClient: 
Loading HoodieTableMetaClient from s3a://bucket_name_b/table_b_prefixtable_name/
   24/01/17 04:01:36 INFO org.apache.hudi.common.table.HoodieTableConfig: 
Loading table properties from 
s3a://bucket_name_b/table_b_prefixtable_name/.hoodie/hoodie.properties
   24/01/17 04:01:36 INFO org.apache.hudi.common.table.HoodieTableMetaClient: 
Finished Loading Table of type COPY_ON_WRITE(version=1, baseFileFormat=PARQUET) 
from s3a://bucket_name_b/table_b_prefixtable_name/
   24/01/17 04:01:36 INFO org.apache.hudi.common.table.HoodieTableMetaClient: 
Loading Active commit timeline for s3a://bucket_name_b/table_b_prefixtable_name/
   24/01/17 04:01:36 INFO 
org.apache.hudi.common.table.timeline.HoodieActiveTimeline: Loaded instants 
upto : Option{val=[20240117034200131__rollback__COMPLETED__20240117034204000]}
   24/01/17 04:01:36 INFO org.apache.hudi.common.table.HoodieTableConfig: 
Loading table properties from 
s3a://bucket_name_b/table_b_prefixtable_name/.hoodie/hoodie.properties
   24/01/17 04:01:36 INFO 
org.apache.hudi.common.table.view.FileSystemViewManager: Creating View Manager 
with storage type :MEMORY
   24/01/17 04:01:36 INFO 
org.apache.hudi.common.table.view.FileSystemViewManager: Creating in-memory 
based Table View
   24/01/17 04:01:36 INFO org.apache.hudi.client.BaseHoodieWriteClient: 
Scheduling Rollback at instant time : 20240117040136218 (exists in active 
timeline: true), with rollback plan: false
   24/01/17 04:01:37 INFO 
org.apache.hudi.common.table.timeline.HoodieActiveTimeline: Loaded instants 
upto : 
Option{val=[==>20240117040136218__rollback__REQUESTED__20240117040138000]}
   24/01/17 04:01:37 INFO 
org.apache.hudi.table.action.rollback.BaseRollbackPlanActionExecutor: 
Requesting Rollback with instant time 
[==>20240117040136218__rollback__REQUESTED]
   24/01/17 04:01:37 INFO 
org.apache.hudi.common.table.timeline.HoodieActiveTimeline: Loaded instants 
upto : 
Option{val=[==>20240117040136218__rollback__REQUESTED__20240117040138000]}
   24/01/17 04:01:37 INFO 
org.apache.hudi.common.table.timeline.HoodieActiveTimeline: Checking for file 
exists 
?s3a://bucket_name_b/table_b_prefixtable_name/.hoodie/20240117040136218.rollback.requested
   24/01/17 04:01:38 INFO 
org.apache.hudi.common.table.timeline.HoodieActiveTimeline: Create new file for 
toInstant 
?s3a://bucket_name_b/table_b_prefixtable_name/.hoodie/20240117040136218.rollback.inflight
   24/01/17 04:01:38 INFO 
org.apache.hudi.table.action.rollback.CopyOnWriteRollbackActionExecutor: 
Time(in ms) taken to finish rollback 0
   24/01/17 04:01:38 INFO 
org.apache.hudi.table.action.rollback.BaseRollbackActionExecutor: Rolled back 
inflight instant 20240117034159138
   24/01/17 04:01:38 INFO 
org.apache.hudi.table.action.rollback.BaseRollbackActionExecutor: Index rolled 
back for commits [==>20240117034159138__commit__REQUESTED__20240117034205000]
   24/01/17 04:01:38 INFO org.apache.hudi.table.HoodieTable: Deleting metadata 
table because it is disabled in writer.
   24/01/17 04:01:38 INFO org.apache.hudi.common.table.HoodieTableMetaClient: 
Loading HoodieTableMetaClient from s3a://bucket_name_b/table_b_prefixtable_name/
   24/01/17 04:01:38 INFO org.apache.hudi.common.table.HoodieTableConfig: 
Loading table properties from 
s3a://bucket_name_b/table_b_prefixtable_name/.hoodie/hoodie.properties
   24/01/17 04:01:38 INFO org.apache.hudi.common.table.HoodieTableMetaClient: 
Finished Loading Table of type COPY_ON_WRITE(version=1, baseFileFormat=PARQUET) 
from s3a://bucket_name_b/table_b_prefixtable_name/
   24/01/17 04:01:39 INFO org.apache.hudi.common.table.HoodieTableConfig: MDT 
s3a://bucket_name_b/table_b_prefixtable_name partition FILES has been disabled
   24/01/17 04:01:40 INFO 
org.apache.hudi.table.action.rollback.BaseRollbackActionExecutor: Deleting 
instant=[==>20240117034159138__commit__REQUESTED__20240117034205000]
   24/01/17 04:01:40 INFO 
org.apache.hudi.common.table.timeline.HoodieActiveTimeline: Deleting instant 
[==>20240117034159138__commit__REQUESTED__20240117034205000]
   24/01/17 04:01:40 INFO 
org.apache.hudi.common.table.timeline.HoodieActiveTimeline: Removed instant 
[==>20240117034159138__commit__REQUESTED__20240117034205000]
   24/01/17 04:01:40 INFO 
org.apache.hudi.table.action.rollback.BaseRollbackActionExecutor: Deleted 
pending commit [==>20240117034159138__commit__REQUESTED__20240117034205000]
   24/01/17 04:01:40 INFO 
org.apache.hudi.common.table.timeline.HoodieActiveTimeline: Checking for file 
exists 
?s3a://bucket_name_b/table_b_prefixtable_name/.hoodie/20240117040136218.rollback.inflight
   24/01/17 04:01:40 INFO 
org.apache.hudi.common.table.timeline.HoodieActiveTimeline: Create new file for 
toInstant 
?s3a://bucket_name_b/table_b_prefixtable_name/.hoodie/20240117040136218.rollback
   24/01/17 04:01:40 INFO 
org.apache.hudi.table.action.rollback.BaseRollbackActionExecutor: Rollback of 
Commits [20240117034159138] is complete
   24/01/17 04:01:40 INFO org.apache.hudi.client.BaseHoodieWriteClient: 
Generate a new instant time: 20240117040134974 action: commit
   24/01/17 04:01:40 INFO 
org.apache.hudi.common.table.timeline.HoodieActiveTimeline: Creating a new 
instant [==>20240117040134974__commit__REQUESTED]
   24/01/17 04:01:41 INFO org.apache.hudi.common.table.HoodieTableMetaClient: 
Loading HoodieTableMetaClient from s3a://bucket_name_b/table_b_prefixtable_name/
   24/01/17 04:01:41 INFO org.apache.hudi.common.table.HoodieTableConfig: 
Loading table properties from 
s3a://bucket_name_b/table_b_prefixtable_name/.hoodie/hoodie.properties
   24/01/17 04:01:41 INFO org.apache.hudi.common.table.HoodieTableMetaClient: 
Finished Loading Table of type COPY_ON_WRITE(version=1, baseFileFormat=PARQUET) 
from s3a://bucket_name_b/table_b_prefixtable_name/
   24/01/17 04:01:41 INFO org.apache.hudi.common.table.HoodieTableMetaClient: 
Loading Active commit timeline for s3a://bucket_name_b/table_b_prefixtable_name/
   24/01/17 04:01:41 INFO 
org.apache.hudi.common.table.timeline.HoodieActiveTimeline: Loaded instants 
upto : Option{val=[20240117040136218__rollback__COMPLETED__20240117040141000]}
   24/01/17 04:01:41 INFO org.apache.hudi.common.table.HoodieTableConfig: 
Loading table properties from 
s3a://bucket_name_b/table_b_prefixtable_name/.hoodie/hoodie.properties
   24/01/17 04:01:41 INFO 
org.apache.hudi.common.table.view.FileSystemViewManager: Creating View Manager 
with storage type :MEMORY
   24/01/17 04:01:41 INFO 
org.apache.hudi.common.table.view.FileSystemViewManager: Creating in-memory 
based Table View
   24/01/17 04:01:41 INFO org.apache.hudi.async.AsyncCleanerService: The 
HoodieWriteClient is not configured to auto & async clean. Async clean service 
will not start.
   24/01/17 04:01:41 INFO org.apache.hudi.async.AsyncArchiveService: The 
HoodieWriteClient is not configured to auto & async archive. Async archive 
service will not start.
   24/01/17 04:04:56 WARN org.apache.spark.scheduler.TaskSetManager: Lost task 
12.0 in stage 45.0 (TID 14742) (10.253.32.5 executor 2): 
java.io.FileNotFoundException: No such file or directory: 
s3a://bucket_name/table_a_prefix_901_az/public/table_name/2024/01/17/33aa56e7-b70e-4793-8274-7a928480e600-0_0-11014-10644_20240117035105942.parquet
   It is possible the underlying files have been updated. You can explicitly 
invalidate the cache in Spark by running 'REFRESH TABLE tableName' command in 
SQL or by recreating the Dataset/DataFrame involved.
        at 
org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.org$apache$spark$sql$execution$datasources$FileScanRDD$$anon$$readCurrentFile(FileScanRDD.scala:124)
        at 
org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:169)
        at 
org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:93)
        at 
org.apache.spark.sql.execution.FileSourceScanExec$$anon$1.hasNext(DataSourceScanExec.scala:503)
        at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.columnartorow_nextBatch_0$(Unknown
 Source)
        at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown
 Source)
        at 
org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
        at 
org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:755)
        at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460)
        at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460)
        at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460)
        at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown
 Source)
        at 
org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
        at 
org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:755)
        at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460)
        at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460)
        at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460)
        at 
org.apache.spark.util.collection.ExternalSorter.insertAll(ExternalSorter.scala:192)
        at 
org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:63)
        at 
org.apache.spark.shuffle.ShuffleWriteProcessor.write(ShuffleWriteProcessor.scala:59)
        at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99)
        at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:52)
        at org.apache.spark.scheduler.Task.run(Task.scala:131)
        at 
org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:505)
        at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1439)
        at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:508)
        at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at java.lang.Thread.run(Thread.java:750)
   
   24/01/17 04:05:30 WARN org.apache.spark.scheduler.TaskSetManager: Lost task 
12.1 in stage 45.0 (TID 14745) (10.253.32.4 executor 1): 
java.io.FileNotFoundException: No such file or directory: 
s3a://bucket_name/table_a_prefix_901_az/public/table_name/2024/01/17/33aa56e7-b70e-4793-8274-7a928480e600-0_0-11014-10644_20240117035105942.parquet
   It is possible the underlying files have been updated. You can explicitly 
invalidate the cache in Spark by running 'REFRESH TABLE tableName' command in 
SQL or by recreating the Dataset/DataFrame involved.
        at 
org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.org$apache$spark$sql$execution$datasources$FileScanRDD$$anon$$readCurrentFile(FileScanRDD.scala:124)
        at 
org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:169)
        at 
org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:93)
        at 
org.apache.spark.sql.execution.FileSourceScanExec$$anon$1.hasNext(DataSourceScanExec.scala:503)
        at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.columnartorow_nextBatch_0$(Unknown
 Source)
        at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown
 Source)
        at 
org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
        at 
org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:755)
        at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460)
        at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460)
        at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460)
        at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown
 Source)
        at 
org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
        at 
org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:755)
        at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460)
        at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460)
        at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460)
        at 
org.apache.spark.util.collection.ExternalSorter.insertAll(ExternalSorter.scala:192)
        at 
org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:63)
        at 
org.apache.spark.shuffle.ShuffleWriteProcessor.write(ShuffleWriteProcessor.scala:59)
        at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99)
        at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:52)
        at org.apache.spark.scheduler.Task.run(Task.scala:131)
        at 
org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:505)
        at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1439)
        at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:508)
        at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at java.lang.Thread.run(Thread.java:750)```
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to