colinre opened a new pull request, #16571: URL: https://github.com/apache/iceberg/pull/16571
### Problem Spark Structured Streaming row-based micro-batch planning was effectively capped at `Integer.MAX_VALUE` rows. This made very large initial backfills impractical because streams over multi-trillion-row tables could require thousands of micro-batches before reaching the live tail. ### Root Cause `streaming-max-rows-per-micro-batch` was parsed and stored as an `int`, and planner defaults initialized the effective row limit to `Integer.MAX_VALUE` even when no row limit was configured. ### Change Parse and propagate the streaming row soft limit as `long`, use `Long.MAX_VALUE` as the unconfigured row-limit sentinel, and preserve complete-file soft-limit behavior. File-count rate limiting is unchanged. This is a Codex change; I'm generally unfamiliar with this codebase. ### Tests Added coverage for long-valued option parsing, unconfigured multi-trillion-row planning, explicit long-valued soft limits, planner default unpacking, and existing small row-limit behavior. The structured streaming planner tests cover both sync and async planning through existing parameterization. ### Compatibility Existing option names, offsets, checkpoint compatibility, file-count limits, and soft-limit semantics are unchanged. Existing values at or below `Integer.MAX_VALUE` keep their behavior. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
