Re: [PR] spark: Don't use table FileIO for checkpointing files [iceberg]

via GitHub Thu, 05 Mar 2026 11:15:05 -0800


danielcweeks commented on code in PR #15239:
URL: https://github.com/apache/iceberg/pull/15239#discussion_r2891737058



##########
docs/docs/spark-configuration.md:
##########
@@ -220,6 +220,7 @@ spark.read
 | stream-from-timestamp | (none) | A timestamp in milliseconds to stream from; 
if before the oldest known ancestor snapshot, the oldest will be used           
                                                  |
 | streaming-max-files-per-micro-batch | INT_MAX | Maximum number of files per 
microbatch                                                                      
                                                                  |
 | streaming-max-rows-per-micro-batch  | INT_MAX | "Soft maximum" number of 
rows per microbatch; always includes all rows in next unprocessed file, 
excludes additional files if their inclusion would exceed the soft max limit |
+| streaming-checkpoint-use-hadoop | false | Use Hadoop FileSystem for 
streaming checkpoint operations instead of the table's FileIO implementation    
                                                                          |

Review Comment:
   If this is purely a spark feature (and it looks like it is), then I would 
agree that we should just remove and Iceberg/FileIO code path that would 
complicate how we thing about pathing that's outside the bounds of the Iceberg 
table and integration.
   
   So if we're going directly through the checkpoint file manager and using 
hadoop files system natively, that seems like the right "Spark" way to do it.  
This also avoids adding more spark specific configs to work around the 
interaction between Iceberg/Spark.
   
   +1 (assuming I understood that correctly).



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] spark: Don't use table FileIO for checkpointing files [iceberg]

Reply via email to