ahmedabu98 commented on issue #32746: URL: https://github.com/apache/beam/issues/32746#issuecomment-2407576605
I believe my pipeline was suffering from too much parallelism -- the work was split over too many concurrent threads within a worker, each one creating its own writers and eating up the worker's memory. High parallelism would also explain the many small files. There's an `activeIcebergWriters` metric that roughly tracks how many open writers there are. Here's an example of way too many writers for only 1 worker: <break> <img width="423" alt="image" src="https://github.com/user-attachments/assets/4a464829-f2d5-4547-88df-39427f09ac87"> I tried the following routes and was able to get around it (in order of effectiveness): - Adding `.apply(Redistribute.<Row>arbitrarily().withNumBuckets(<N>))` before the write step, reducing the parallelism to `N` - Use the `--numberOfWorkerHarnessThreads=N` pipeline option, which sets an upper bound to the number of threads per worker - Use machine type with larger memory (`--workerMarchineType=<type>`). The default for streaming engine is `n1-standard-2`, so you may go for `n1-standard-4` or `n1-standard-8`. This will be able to handle the larger parallelism but you may still end up with small files P.S. we also have metrics that show the distribution (max, mean, min) of data file byte size (`committedDataFileByteSize`) and record count (`committedDataFileRecordCount`). These are good ways to measure the effectiveness of each option -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
