remeajayi2022 opened a new issue, #10519: URL: https://github.com/apache/hudi/issues/10519
**_Tips before filing an issue_** - Have you gone through our [FAQs](https://hudi.apache.org/learn/faq/)? Yes - Join the mailing list to engage in conversations and get faster support at [email protected]. - If you have triaged this as a bug, then file an [issue](https://issues.apache.org/jira/projects/HUDI/issues) directly. **Describe the problem you faced** During a spark structured streaming read from a Hudi table, the spark job fails while trying to access cleaned parquet files on S3. To address this, I have set the `hoodie.datasource.read.incr.fallback.fulltablescan.enable` to true. After setting this configuration, the spark job runs from ~20minutes and fails again due to a FileNotFoundException. This job is designed to read from a Hudi Table A and write to a Hudi Table B. **To Reproduce** 1. Create a Deltastreamer job that reads from a Kafka topic and writes to Table A with the following configurations: ``` --Cleaning Policy "hoodie.cleaner.policy":"KEEP_LATEST_COMMITS" #default "hoodie.cleaner.commits.retained":"10" #default --Archiving Policy "hoodie.keep.min.commits":"100" "hoodie.keep.max.commits":"130" ``` 2. Create a Pyspark structured streaming job that reads from Table B with the following configurations: ``` "hoodie.enable.data.skipping": "true" "hoodie.datasource.read.incr.fallback.fulltablescan.enable": "true" "spark.hadoop.fs.s3a.impl": "org.apache.hadoop.fs.s3a.S3AFileSystem" "spark.hadoop.fs.s3a.aws.credentials.provider": ("org.apache.hadoop.fs.s3a.SimpleAWSCredentialsProvider" ) "hoodie.metadata.enable": "true" "hoodie.datasource.query.type": "incremental" "hoodie.datasource.read.begin.instanttime": "20231106000000" ``` 3. Let the job in step 2 fall (probably by pausing) behind long enough to access commits with cleaned files. This is to ensure that the `fulltablescan` configuration is addressing the `PathNotFound` error. **Expected behavior** The Pyspark job should successfully read from Table A and write to Table B. There should be parquet files present in Table B. **Environment Description** * Hudi version : 0.14.0 * Spark version : 3.1.1 * Hive version : * Hadoop version : * Storage (HDFS/S3/GCS..) : S3 * Running on Docker? (yes/no) : No **Additional context** There is no data written by the job while it's doing a full table scan for those 20 minutes before failure. Is this the expected behaviour? Please note that the Hudi JAR I'm using is the stable 0.14.0 release built with the changes to `HoodieAsyncService.Java`, `WriteStatus.java` and `TimestampBasedAvroKeyGenerator.java` shown [here](https://github.com/apache/hudi/compare/master...sydneyhoran:hudi:master).` Add any other context about the problem here. **Stacktrace** ```24/01/17 03:57:07 INFO org.apache.hudi.IncrementalRelation: Inferring schema.. 24/01/17 03:57:14 INFO org.apache.hudi.IncrementalRelation: Falling back to full table scan as startInstantArchived: true, endInstantArchived: false 24/01/17 03:57:14 INFO org.apache.hudi.DataSourceUtils: Getting table path.. 24/01/17 03:57:14 INFO org.apache.hudi.common.util.TablePathUtils: Getting table path from path : s3a://bucket_name/table_a_prefix_901_ia/public/table_name 24/01/17 03:57:14 INFO org.apache.hudi.DefaultSource: Obtained hudi table path: s3a://bucket_name/table_a_prefix_901_ia/public/table_name 24/01/17 03:57:14 INFO org.apache.hudi.common.table.HoodieTableMetaClient: Loading HoodieTableMetaClient from s3a://bucket_name/table_a_prefix_901_ia/public/table_name 24/01/17 03:57:15 INFO org.apache.hudi.common.table.HoodieTableConfig: Loading table properties from s3a://bucket_name/table_a_prefix_901_ia/public/table_name/.hoodie/hoodie.properties 24/01/17 03:57:15 INFO org.apache.hudi.common.table.HoodieTableMetaClient: Finished Loading Table of type COPY_ON_WRITE(version=1, baseFileFormat=PARQUET) from s3a://bucket_name/table_a_prefix_901_ia/public/table_name 24/01/17 03:57:15 INFO org.apache.hudi.DefaultSource: Is bootstrapped table => false, tableType is: COPY_ON_WRITE, queryType is: snapshot 24/01/17 03:57:15 INFO org.apache.hudi.common.table.timeline.HoodieActiveTimeline: Loaded instants upto : Option{val=[==>20240117035712711__clean__REQUESTED__20240117035716000]} 24/01/17 03:57:15 INFO org.apache.hudi.common.table.HoodieTableConfig: Loading table properties from s3a://bucket_name/table_prefix_901_ia/public/table_name/.hoodie/hoodie.properties 24/01/17 03:57:15 INFO org.apache.hudi.common.table.timeline.HoodieActiveTimeline: Loaded instants upto : Option{val=[==>20240117035712711__clean__REQUESTED__20240117035716000]} 24/01/17 03:57:15 INFO org.apache.hudi.BaseHoodieTableFileIndex: Refresh table production-table_a_prefix_ia.public.table_name, spent: 445 ms 24/01/17 03:57:43 INFO org.apache.hudi.HoodieFileIndex: No partition predicates provided, listing full table (816 partitions) 24/01/17 03:58:42 INFO org.apache.hudi.common.table.view.AbstractTableFileSystemView: Took 0 ms to read 0 instants, 0 replaced file groups 24/01/17 03:58:42 INFO org.apache.hudi.common.util.ClusteringUtils: Found 0 files in pending clustering operations 24/01/17 03:58:42 INFO org.apache.hudi.HoodieFileIndex: Total file slices: 816; candidate file slices after data skipping: 816; skipping percentage 0.0 24/01/17 03:58:42 INFO org.apache.hudi.common.table.HoodieTableMetaClient: Loading HoodieTableMetaClient from s3a://bucket_name/table_a_prefix_901_in/public/table_name 24/01/17 03:58:42 INFO org.apache.hudi.common.table.HoodieTableConfig: Loading table properties from s3a://bucket_name/table_a_prefix_901_in/public/table_name/.hoodie/hoodie.properties 24/01/17 03:58:43 INFO org.apache.hudi.common.table.HoodieTableMetaClient: Finished Loading Table of type COPY_ON_WRITE(version=1, baseFileFormat=PARQUET) from s3a://bucket_name/table_a_prefix_901_in/public/table_name 24/01/17 03:58:43 INFO org.apache.hudi.common.table.HoodieTableConfig: Loading table properties from s3a://bucket_name/table_a_prefix_901_in/public/table_name/.hoodie/hoodie.properties 24/01/17 03:58:43 INFO org.apache.hudi.common.table.view.FileSystemViewManager: Creating View Manager with storage type :MEMORY 24/01/17 03:58:43 INFO org.apache.hudi.common.table.view.FileSystemViewManager: Creating in-memory based Table View 24/01/17 03:58:43 INFO org.apache.hudi.common.table.timeline.HoodieActiveTimeline: Loaded instants upto : Option{val=[==>20240117035839093__clean__INFLIGHT__20240117035843000]} 24/01/17 03:58:43 INFO org.apache.hudi.IncrementalRelation: Inferring schema.. 24/01/17 03:58:49 INFO org.apache.hudi.IncrementalRelation: Falling back to full table scan as startInstantArchived: true, endInstantArchived: false 24/01/17 03:58:49 INFO org.apache.hudi.DataSourceUtils: Getting table path.. 24/01/17 03:58:49 INFO org.apache.hudi.common.util.TablePathUtils: Getting table path from path : s3a://bucket_name/table_a_prefix_901_in/public/table_name 24/01/17 03:58:49 INFO org.apache.hudi.DefaultSource: Obtained hudi table path: s3a://bucket_name/table_a_prefix_901_in/public/table_name 24/01/17 03:58:49 INFO org.apache.hudi.common.table.HoodieTableMetaClient: Loading HoodieTableMetaClient from s3a://bucket_name/table_a_prefix_901_in/public/table_name 24/01/17 03:58:49 INFO org.apache.hudi.common.table.HoodieTableConfig: Loading table properties from s3a://bucket_name/table_a_prefix_901_in/public/table_name/.hoodie/hoodie.properties 24/01/17 03:58:50 INFO org.apache.hudi.common.table.HoodieTableMetaClient: Finished Loading Table of type COPY_ON_WRITE(version=1, baseFileFormat=PARQUET) from s3a://bucket_name/table_a_prefix_901_in/public/table_name 24/01/17 03:58:50 INFO org.apache.hudi.DefaultSource: Is bootstrapped table => false, tableType is: COPY_ON_WRITE, queryType is: snapshot 24/01/17 03:58:50 INFO org.apache.hudi.common.table.timeline.HoodieActiveTimeline: Loaded instants upto : Option{val=[==>20240117035848705__commit__REQUESTED__20240117035850000]} 24/01/17 03:58:50 INFO org.apache.hudi.common.table.HoodieTableConfig: Loading table properties from s3a://bucket_name/table_a_prefix_901_in/public/table_name/.hoodie/hoodie.properties 24/01/17 03:58:50 INFO org.apache.hudi.common.table.timeline.HoodieActiveTimeline: Loaded instants upto : Option{val=[==>20240117035848705__commit__REQUESTED__20240117035850000]} 24/01/17 03:58:50 INFO org.apache.hudi.BaseHoodieTableFileIndex: Refresh table production-table_a_prefix_in_a.public.table_name, spent: 489 ms 24/01/17 03:59:22 INFO org.apache.hudi.HoodieFileIndex: No partition predicates provided, listing full table (985 partitions) 24/01/17 04:00:37 INFO org.apache.hudi.common.table.view.AbstractTableFileSystemView: Took 0 ms to read 0 instants, 0 replaced file groups 24/01/17 04:00:37 INFO org.apache.hudi.common.util.ClusteringUtils: Found 0 files in pending clustering operations 24/01/17 04:00:37 INFO org.apache.hudi.HoodieFileIndex: Total file slices: 985; candidate file slices after data skipping: 985; skipping percentage 0.0 24/01/17 04:00:37 INFO org.apache.hudi.common.table.HoodieTableMetaClient: Loading HoodieTableMetaClient from s3a://bucket_name/table_a_prefix_901_md/public/table_name 24/01/17 04:00:38 INFO org.apache.hudi.common.table.HoodieTableConfig: Loading table properties from s3a://bucket_name/table_a_prefix_901_md/public/table_name/.hoodie/hoodie.properties 24/01/17 04:00:38 INFO org.apache.hudi.common.table.HoodieTableMetaClient: Finished Loading Table of type COPY_ON_WRITE(version=1, baseFileFormat=PARQUET) from s3a://bucket_name/table_a_prefix_901_md/public/table_name 24/01/17 04:00:38 INFO org.apache.hudi.common.table.HoodieTableConfig: Loading table properties from s3a://bucket_name/table_a_prefix_901_md/public/table_name/.hoodie/hoodie.properties 24/01/17 04:00:38 INFO org.apache.hudi.common.table.view.FileSystemViewManager: Creating View Manager with storage type :MEMORY 24/01/17 04:00:38 INFO org.apache.hudi.common.table.view.FileSystemViewManager: Creating in-memory based Table View 24/01/17 04:00:38 INFO org.apache.hudi.common.table.timeline.HoodieActiveTimeline: Loaded instants upto : Option{val=[==>20240117040019208__commit__INFLIGHT__20240117040034000]} 24/01/17 04:00:38 INFO org.apache.hudi.IncrementalRelation: Inferring schema.. 24/01/17 04:00:45 INFO org.apache.hudi.IncrementalRelation: Falling back to full table scan as startInstantArchived: true, endInstantArchived: false 24/01/17 04:00:45 INFO org.apache.hudi.DataSourceUtils: Getting table path.. 24/01/17 04:00:45 INFO org.apache.hudi.common.util.TablePathUtils: Getting table path from path : s3a://bucket_name/table_a_prefix_901_md/public/table_name 24/01/17 04:00:46 INFO org.apache.hudi.DefaultSource: Obtained hudi table path: s3a://bucket_name/table_a_prefix_901_md/public/table_name 24/01/17 04:00:46 INFO org.apache.hudi.common.table.HoodieTableMetaClient: Loading HoodieTableMetaClient from s3a://bucket_name/table_a_prefix_901_md/public/table_name 24/01/17 04:00:46 INFO org.apache.hudi.common.table.HoodieTableConfig: Loading table properties from s3a://bucket_name/table_a_prefix_901_md/public/table_name/.hoodie/hoodie.properties 24/01/17 04:00:46 INFO org.apache.hudi.common.table.HoodieTableMetaClient: Finished Loading Table of type COPY_ON_WRITE(version=1, baseFileFormat=PARQUET) from s3a://bucket_name/table_a_prefix_901_md/public/table_name 24/01/17 04:00:46 INFO org.apache.hudi.DefaultSource: Is bootstrapped table => false, tableType is: COPY_ON_WRITE, queryType is: snapshot 24/01/17 04:00:46 INFO org.apache.hudi.common.table.timeline.HoodieActiveTimeline: Loaded instants upto : Option{val=[20240117040041145__clean__COMPLETED__20240117040046000]} 24/01/17 04:00:46 INFO org.apache.hudi.common.table.HoodieTableConfig: Loading table properties from s3a://bucket_name/table_a_prefix_901_md/public/table_name/.hoodie/hoodie.properties 24/01/17 04:00:47 INFO org.apache.hudi.common.table.timeline.HoodieActiveTimeline: Loaded instants upto : Option{val=[20240117040041145__clean__COMPLETED__20240117040046000]} 24/01/17 04:00:47 INFO org.apache.hudi.BaseHoodieTableFileIndex: Refresh table production-table_a_prefix_md_a.public.table_name, spent: 432 ms 24/01/17 04:01:02 INFO org.apache.hudi.HoodieFileIndex: No partition predicates provided, listing full table (426 partitions) 24/01/17 04:01:33 INFO org.apache.hudi.common.table.view.AbstractTableFileSystemView: Took 0 ms to read 0 instants, 0 replaced file groups 24/01/17 04:01:33 INFO org.apache.hudi.common.util.ClusteringUtils: Found 0 files in pending clustering operations 24/01/17 04:01:33 INFO org.apache.hudi.HoodieFileIndex: Total file slices: 426; candidate file slices after data skipping: 426; skipping percentage 0.0 24/01/17 04:01:34 INFO org.apache.hudi.common.table.timeline.HoodieActiveTimeline: Loaded instants upto : Option{val=[20240117034200131__rollback__COMPLETED__20240117034204000]} 24/01/17 04:01:34 INFO org.apache.hudi.common.table.HoodieTableMetaClient: Loading HoodieTableMetaClient from s3a://bucket_name_b/table_b_prefixtable_name/ 24/01/17 04:01:34 INFO org.apache.hudi.common.table.HoodieTableConfig: Loading table properties from s3a://bucket_name_b/table_b_prefixtable_name/.hoodie/hoodie.properties 24/01/17 04:01:34 INFO org.apache.hudi.common.table.HoodieTableMetaClient: Finished Loading Table of type COPY_ON_WRITE(version=1, baseFileFormat=PARQUET) from s3a://bucket_name_b/table_b_prefixtable_name/ 24/01/17 04:01:34 INFO org.apache.hudi.common.table.HoodieTableMetaClient: Loading HoodieTableMetaClient from s3a://bucket_name_b/table_b_prefixtable_name/ 24/01/17 04:01:34 INFO org.apache.hudi.common.table.HoodieTableConfig: Loading table properties from s3a://bucket_name_b/table_b_prefixtable_name/.hoodie/hoodie.properties 24/01/17 04:01:34 INFO org.apache.hudi.common.table.HoodieTableMetaClient: Finished Loading Table of type COPY_ON_WRITE(version=1, baseFileFormat=PARQUET) from s3a://bucket_name_b/table_b_prefixtable_name/ 24/01/17 04:01:35 INFO org.apache.hudi.common.table.timeline.HoodieActiveTimeline: Loaded instants upto : Option{val=[20240117034200131__rollback__COMPLETED__20240117034204000]} 24/01/17 04:01:35 WARN org.apache.hudi.config.HoodieWriteConfig: Embedded timeline server is disabled, fallback to use direct marker type for spark 24/01/17 04:01:35 INFO org.apache.hudi.client.BaseHoodieClient: Embedded Timeline Server is disabled. Not starting timeline service 24/01/17 04:01:35 INFO org.apache.hudi.client.BaseHoodieClient: Embedded Timeline Server is disabled. Not starting timeline service 24/01/17 04:01:35 INFO org.apache.hudi.HoodieSparkSqlWriter$: Config.inlineCompactionEnabled ? false 24/01/17 04:01:35 INFO org.apache.hudi.HoodieSparkSqlWriter$: Config.asyncClusteringEnabled ? false 24/01/17 04:01:35 INFO org.apache.hudi.common.table.HoodieTableMetaClient: Loading HoodieTableMetaClient from s3a://bucket_name_b/table_b_prefixtable_name/ 24/01/17 04:01:35 INFO org.apache.hudi.common.table.HoodieTableConfig: Loading table properties from s3a://bucket_name_b/table_b_prefixtable_name/.hoodie/hoodie.properties 24/01/17 04:01:35 INFO org.apache.hudi.common.table.HoodieTableMetaClient: Finished Loading Table of type COPY_ON_WRITE(version=1, baseFileFormat=PARQUET) from s3a://bucket_name_b/table_b_prefixtable_name/ 24/01/17 04:01:35 INFO org.apache.hudi.common.table.HoodieTableMetaClient: Loading Active commit timeline for s3a://bucket_name_b/table_b_prefixtable_name/ 24/01/17 04:01:35 INFO org.apache.hudi.common.table.timeline.HoodieActiveTimeline: Loaded instants upto : Option{val=[20240117034200131__rollback__COMPLETED__20240117034204000]} 24/01/17 04:01:35 INFO org.apache.hudi.common.util.CleanerUtils: Cleaned failed attempts if any 24/01/17 04:01:35 INFO org.apache.hudi.common.table.HoodieTableMetaClient: Loading HoodieTableMetaClient from s3a://bucket_name_b/table_b_prefixtable_name/ 24/01/17 04:01:35 INFO org.apache.hudi.common.table.HoodieTableConfig: Loading table properties from s3a://bucket_name_b/table_b_prefixtable_name/.hoodie/hoodie.properties 24/01/17 04:01:35 INFO org.apache.hudi.common.table.HoodieTableMetaClient: Finished Loading Table of type COPY_ON_WRITE(version=1, baseFileFormat=PARQUET) from s3a://bucket_name_b/table_b_prefixtable_name/ 24/01/17 04:01:35 INFO org.apache.hudi.common.table.HoodieTableMetaClient: Loading Active commit timeline for s3a://bucket_name_b/table_b_prefixtable_name/ 24/01/17 04:01:36 INFO org.apache.hudi.common.table.timeline.HoodieActiveTimeline: Loaded instants upto : Option{val=[20240117034200131__rollback__COMPLETED__20240117034204000]} 24/01/17 04:01:36 INFO org.apache.hudi.common.table.HoodieTableConfig: Loading table properties from s3a://bucket_name_b/table_b_prefixtable_name/.hoodie/hoodie.properties 24/01/17 04:01:36 INFO org.apache.hudi.common.table.view.FileSystemViewManager: Creating View Manager with storage type :MEMORY 24/01/17 04:01:36 INFO org.apache.hudi.common.table.view.FileSystemViewManager: Creating in-memory based Table View 24/01/17 04:01:36 INFO org.apache.hudi.client.BaseHoodieWriteClient: Begin rollback of instant 20240117034159138 24/01/17 04:01:36 INFO org.apache.hudi.common.table.HoodieTableMetaClient: Loading HoodieTableMetaClient from s3a://bucket_name_b/table_b_prefixtable_name/ 24/01/17 04:01:36 INFO org.apache.hudi.common.table.HoodieTableConfig: Loading table properties from s3a://bucket_name_b/table_b_prefixtable_name/.hoodie/hoodie.properties 24/01/17 04:01:36 INFO org.apache.hudi.common.table.HoodieTableMetaClient: Finished Loading Table of type COPY_ON_WRITE(version=1, baseFileFormat=PARQUET) from s3a://bucket_name_b/table_b_prefixtable_name/ 24/01/17 04:01:36 INFO org.apache.hudi.common.table.HoodieTableMetaClient: Loading Active commit timeline for s3a://bucket_name_b/table_b_prefixtable_name/ 24/01/17 04:01:36 INFO org.apache.hudi.common.table.timeline.HoodieActiveTimeline: Loaded instants upto : Option{val=[20240117034200131__rollback__COMPLETED__20240117034204000]} 24/01/17 04:01:36 INFO org.apache.hudi.common.table.HoodieTableConfig: Loading table properties from s3a://bucket_name_b/table_b_prefixtable_name/.hoodie/hoodie.properties 24/01/17 04:01:36 INFO org.apache.hudi.common.table.view.FileSystemViewManager: Creating View Manager with storage type :MEMORY 24/01/17 04:01:36 INFO org.apache.hudi.common.table.view.FileSystemViewManager: Creating in-memory based Table View 24/01/17 04:01:36 INFO org.apache.hudi.client.BaseHoodieWriteClient: Scheduling Rollback at instant time : 20240117040136218 (exists in active timeline: true), with rollback plan: false 24/01/17 04:01:37 INFO org.apache.hudi.common.table.timeline.HoodieActiveTimeline: Loaded instants upto : Option{val=[==>20240117040136218__rollback__REQUESTED__20240117040138000]} 24/01/17 04:01:37 INFO org.apache.hudi.table.action.rollback.BaseRollbackPlanActionExecutor: Requesting Rollback with instant time [==>20240117040136218__rollback__REQUESTED] 24/01/17 04:01:37 INFO org.apache.hudi.common.table.timeline.HoodieActiveTimeline: Loaded instants upto : Option{val=[==>20240117040136218__rollback__REQUESTED__20240117040138000]} 24/01/17 04:01:37 INFO org.apache.hudi.common.table.timeline.HoodieActiveTimeline: Checking for file exists ?s3a://bucket_name_b/table_b_prefixtable_name/.hoodie/20240117040136218.rollback.requested 24/01/17 04:01:38 INFO org.apache.hudi.common.table.timeline.HoodieActiveTimeline: Create new file for toInstant ?s3a://bucket_name_b/table_b_prefixtable_name/.hoodie/20240117040136218.rollback.inflight 24/01/17 04:01:38 INFO org.apache.hudi.table.action.rollback.CopyOnWriteRollbackActionExecutor: Time(in ms) taken to finish rollback 0 24/01/17 04:01:38 INFO org.apache.hudi.table.action.rollback.BaseRollbackActionExecutor: Rolled back inflight instant 20240117034159138 24/01/17 04:01:38 INFO org.apache.hudi.table.action.rollback.BaseRollbackActionExecutor: Index rolled back for commits [==>20240117034159138__commit__REQUESTED__20240117034205000] 24/01/17 04:01:38 INFO org.apache.hudi.table.HoodieTable: Deleting metadata table because it is disabled in writer. 24/01/17 04:01:38 INFO org.apache.hudi.common.table.HoodieTableMetaClient: Loading HoodieTableMetaClient from s3a://bucket_name_b/table_b_prefixtable_name/ 24/01/17 04:01:38 INFO org.apache.hudi.common.table.HoodieTableConfig: Loading table properties from s3a://bucket_name_b/table_b_prefixtable_name/.hoodie/hoodie.properties 24/01/17 04:01:38 INFO org.apache.hudi.common.table.HoodieTableMetaClient: Finished Loading Table of type COPY_ON_WRITE(version=1, baseFileFormat=PARQUET) from s3a://bucket_name_b/table_b_prefixtable_name/ 24/01/17 04:01:39 INFO org.apache.hudi.common.table.HoodieTableConfig: MDT s3a://bucket_name_b/table_b_prefixtable_name partition FILES has been disabled 24/01/17 04:01:40 INFO org.apache.hudi.table.action.rollback.BaseRollbackActionExecutor: Deleting instant=[==>20240117034159138__commit__REQUESTED__20240117034205000] 24/01/17 04:01:40 INFO org.apache.hudi.common.table.timeline.HoodieActiveTimeline: Deleting instant [==>20240117034159138__commit__REQUESTED__20240117034205000] 24/01/17 04:01:40 INFO org.apache.hudi.common.table.timeline.HoodieActiveTimeline: Removed instant [==>20240117034159138__commit__REQUESTED__20240117034205000] 24/01/17 04:01:40 INFO org.apache.hudi.table.action.rollback.BaseRollbackActionExecutor: Deleted pending commit [==>20240117034159138__commit__REQUESTED__20240117034205000] 24/01/17 04:01:40 INFO org.apache.hudi.common.table.timeline.HoodieActiveTimeline: Checking for file exists ?s3a://bucket_name_b/table_b_prefixtable_name/.hoodie/20240117040136218.rollback.inflight 24/01/17 04:01:40 INFO org.apache.hudi.common.table.timeline.HoodieActiveTimeline: Create new file for toInstant ?s3a://bucket_name_b/table_b_prefixtable_name/.hoodie/20240117040136218.rollback 24/01/17 04:01:40 INFO org.apache.hudi.table.action.rollback.BaseRollbackActionExecutor: Rollback of Commits [20240117034159138] is complete 24/01/17 04:01:40 INFO org.apache.hudi.client.BaseHoodieWriteClient: Generate a new instant time: 20240117040134974 action: commit 24/01/17 04:01:40 INFO org.apache.hudi.common.table.timeline.HoodieActiveTimeline: Creating a new instant [==>20240117040134974__commit__REQUESTED] 24/01/17 04:01:41 INFO org.apache.hudi.common.table.HoodieTableMetaClient: Loading HoodieTableMetaClient from s3a://bucket_name_b/table_b_prefixtable_name/ 24/01/17 04:01:41 INFO org.apache.hudi.common.table.HoodieTableConfig: Loading table properties from s3a://bucket_name_b/table_b_prefixtable_name/.hoodie/hoodie.properties 24/01/17 04:01:41 INFO org.apache.hudi.common.table.HoodieTableMetaClient: Finished Loading Table of type COPY_ON_WRITE(version=1, baseFileFormat=PARQUET) from s3a://bucket_name_b/table_b_prefixtable_name/ 24/01/17 04:01:41 INFO org.apache.hudi.common.table.HoodieTableMetaClient: Loading Active commit timeline for s3a://bucket_name_b/table_b_prefixtable_name/ 24/01/17 04:01:41 INFO org.apache.hudi.common.table.timeline.HoodieActiveTimeline: Loaded instants upto : Option{val=[20240117040136218__rollback__COMPLETED__20240117040141000]} 24/01/17 04:01:41 INFO org.apache.hudi.common.table.HoodieTableConfig: Loading table properties from s3a://bucket_name_b/table_b_prefixtable_name/.hoodie/hoodie.properties 24/01/17 04:01:41 INFO org.apache.hudi.common.table.view.FileSystemViewManager: Creating View Manager with storage type :MEMORY 24/01/17 04:01:41 INFO org.apache.hudi.common.table.view.FileSystemViewManager: Creating in-memory based Table View 24/01/17 04:01:41 INFO org.apache.hudi.async.AsyncCleanerService: The HoodieWriteClient is not configured to auto & async clean. Async clean service will not start. 24/01/17 04:01:41 INFO org.apache.hudi.async.AsyncArchiveService: The HoodieWriteClient is not configured to auto & async archive. Async archive service will not start. 24/01/17 04:04:56 WARN org.apache.spark.scheduler.TaskSetManager: Lost task 12.0 in stage 45.0 (TID 14742) (10.253.32.5 executor 2): java.io.FileNotFoundException: No such file or directory: s3a://bucket_name/table_a_prefix_901_az/public/table_name/2024/01/17/33aa56e7-b70e-4793-8274-7a928480e600-0_0-11014-10644_20240117035105942.parquet It is possible the underlying files have been updated. You can explicitly invalidate the cache in Spark by running 'REFRESH TABLE tableName' command in SQL or by recreating the Dataset/DataFrame involved. at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.org$apache$spark$sql$execution$datasources$FileScanRDD$$anon$$readCurrentFile(FileScanRDD.scala:124) at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:169) at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:93) at org.apache.spark.sql.execution.FileSourceScanExec$$anon$1.hasNext(DataSourceScanExec.scala:503) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.columnartorow_nextBatch_0$(Unknown Source) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source) at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) at org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:755) at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460) at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460) at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source) at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) at org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:755) at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460) at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460) at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460) at org.apache.spark.util.collection.ExternalSorter.insertAll(ExternalSorter.scala:192) at org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:63) at org.apache.spark.shuffle.ShuffleWriteProcessor.write(ShuffleWriteProcessor.scala:59) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:52) at org.apache.spark.scheduler.Task.run(Task.scala:131) at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:505) at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1439) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:508) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:750) 24/01/17 04:05:30 WARN org.apache.spark.scheduler.TaskSetManager: Lost task 12.1 in stage 45.0 (TID 14745) (10.253.32.4 executor 1): java.io.FileNotFoundException: No such file or directory: s3a://bucket_name/table_a_prefix_901_az/public/table_name/2024/01/17/33aa56e7-b70e-4793-8274-7a928480e600-0_0-11014-10644_20240117035105942.parquet It is possible the underlying files have been updated. You can explicitly invalidate the cache in Spark by running 'REFRESH TABLE tableName' command in SQL or by recreating the Dataset/DataFrame involved. at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.org$apache$spark$sql$execution$datasources$FileScanRDD$$anon$$readCurrentFile(FileScanRDD.scala:124) at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:169) at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:93) at org.apache.spark.sql.execution.FileSourceScanExec$$anon$1.hasNext(DataSourceScanExec.scala:503) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.columnartorow_nextBatch_0$(Unknown Source) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source) at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) at org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:755) at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460) at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460) at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source) at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) at org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:755) at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460) at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460) at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460) at org.apache.spark.util.collection.ExternalSorter.insertAll(ExternalSorter.scala:192) at org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:63) at org.apache.spark.shuffle.ShuffleWriteProcessor.write(ShuffleWriteProcessor.scala:59) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:52) at org.apache.spark.scheduler.Task.run(Task.scala:131) at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:505) at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1439) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:508) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:750)``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
