danielcweeks commented on code in PR #15239:
URL: https://github.com/apache/iceberg/pull/15239#discussion_r2891737058
##########
docs/docs/spark-configuration.md:
##########
@@ -220,6 +220,7 @@ spark.read
| stream-from-timestamp | (none) | A timestamp in milliseconds to stream from;
if before the oldest known ancestor snapshot, the oldest will be used
|
| streaming-max-files-per-micro-batch | INT_MAX | Maximum number of files per
microbatch
|
| streaming-max-rows-per-micro-batch | INT_MAX | "Soft maximum" number of
rows per microbatch; always includes all rows in next unprocessed file,
excludes additional files if their inclusion would exceed the soft max limit |
+| streaming-checkpoint-use-hadoop | false | Use Hadoop FileSystem for
streaming checkpoint operations instead of the table's FileIO implementation
|
Review Comment:
If this is purely a spark feature (and it looks like it is), then I would
agree that we should just remove and Iceberg/FileIO code path that would
complicate how we thing about pathing that's outside the bounds of the Iceberg
table and integration.
So if we're going directly through the checkpoint file manager and using
hadoop files system natively, that seems like the right "Spark" way to do it.
This also avoids adding more spark specific configs to work around the
interaction between Iceberg/Spark.
+1 (assuming I understood that correctly).
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]