mr-brobot opened a new pull request, #8104: URL: https://github.com/apache/iceberg/pull/8104
I noticed this in a previous PR and thought I'd try to optimize. This PR is just a suggestion. Curious for feedback! ### Background In `project_table`, multiple threads convert an iterable of `FileScanTask` to an iterable of `pa.Table`. When a limit is supplied, synchronization is required because each thread needs to know how many records have been read across all completed futures. Once the limit has been reached, all active/future threads short-circuit and return `None`. We currently wait for all futures to complete even after reaching the desired number of rows. ### Proposal 1. Instead of coordinating on a single mutable integer, use an append-only row count log. [List appends are thread-safe thanks to the GIL](https://docs.python.org/3/faq/library.html#what-kinds-of-global-value-mutation-are-thread-safe) and the global row count can be represented as the sum of row count log. 2. We keep track of the number of acquired rows and return the result immediately instead of waiting for the remaining futures to return nothing. The `row_count_log` approach requires some more memory to store the individual row counts, and some more compute to sum over row counts, but my assumption is that data files typically produce large row counts and limits are typically relatively small. In Python >=3.9, we can discard the `row_count_log` entirely and simply set `cancel_futures` in the call to [`Executor.shutdown`](https://docs.python.org/3/library/concurrent.futures.html#concurrent.futures.Executor.shutdown). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
