Re: [I] [Bug]: IcebergIO - Write performance issues [beam]

via GitHub Fri, 11 Oct 2024 07:49:17 -0700


ahmedabu98 commented on issue #32746:
URL: https://github.com/apache/beam/issues/32746#issuecomment-2407576605


   I believe my pipeline was suffering from too much parallelism -- the work 
was split over too many concurrent threads within a worker, each one creating 
its own writers and eating up the worker's memory. High parallelism would also 
explain the many small files. There's an `activeIcebergWriters` metric that 
roughly tracks how many open writers there are. Here's an example of way too 
many writers for only 1 worker:
   <break>
   <img width="423" alt="image" 
src="https://github.com/user-attachments/assets/4a464829-f2d5-4547-88df-39427f09ac87";>
   
   
   I tried the following routes and was able to get around it (in order of 
effectiveness):
   
   - Adding `.apply(Redistribute.<Row>arbitrarily().withNumBuckets(<N>))` 
before the write step, reducing the parallelism to `N` 
   - Use the `--numberOfWorkerHarnessThreads=N` pipeline option, which sets an 
upper bound to the number of threads per worker
   - Use machine type with larger memory (`--workerMarchineType=<type>`). The 
default for streaming engine is `n1-standard-2`, so you may go for 
`n1-standard-4` or `n1-standard-8`. This will be able to handle the larger 
parallelism but you may still end up with small files
   
   P.S. we also have metrics that show the distribution (max, mean, min) of 
data file byte size (`committedDataFileByteSize`) and record count 
(`committedDataFileRecordCount`). These are good ways to measure the 
effectiveness of each option


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [I] [Bug]: IcebergIO - Write performance issues [beam]

Reply via email to