stevenzwu edited a comment on issue #1383: URL: https://github.com/apache/iceberg/issues/1383#issuecomment-707225772
I was mainly discussing in the context of FLIP-27 source. Regardless how we implement the enumeration, there are two pieces of info that enumerator needs to track and checkpoint. 1. last snapshot where enumeration/planning is done 2. pending/unprocessed splits from previous discoveries/plannings I was mainly concerned about the state size for the latter. That is where I was referring to throttle the eagerness of planned splits. I was thinking about using `TableScan.useSnapshot(long snapshotId)` so that we can control how many snapshots we plan the splits into state. Here are some additional benefits of enumerating splits snapshot by snapshot. * We can track and assign splits snapshot by snapshot in the same order as they were committed * We can publish metrics like the number of pending snapshots, lag (current time - oldest timestamp from uncompleted snapshot), etc. @openinx note that this is not keyed state where state is distributed among parallel tasks. Here, 8 GB operator state can be problematic enumerator state. I vaguely remember RocksDB can't handle a list larger than 1 GB. the bigger the list, the slower it gets. also if we do `planTasks` (vs `planFiles`), the number of splits can be a few times bigger. I can definitely buy the point of starting with sth simple, and optimize it later. It will be an internal change to the enumerator. So it has no user impact. @JingsongLi Yeah, the key thing is how coordinator/enumerator controls how the splits are generated. I was saying that we may need some control/throttling there to avoid eagerly enumerate all pending snapshots so that the checkpointed split list is manageable/capped. I thought the idea `TableScan.appendsBetween` was to run `planFiles` or `planTasks` between last planned snapshot and the latest table snapshot. that is what I was referring earlier as eager discovery/planning of all unconsumed splits. ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
