Ironlink commented on issue #22160:
URL: https://github.com/apache/beam/issues/22160#issuecomment-1182979277

   @egalpin I currently have:
   * 6.5 TBs of source documents
   * 8.5 TBs of primary shard data (measured as index disk usage when index has 
no replicas)
   * 6 data nodes with 29 GBs of machine memory each (3 GBs lost to overhead)
   * 2 coordinator nodes with 13 GBs of machine memory each (3 GBs lost to 
overhead)
   
   In my pipeline, I have a maximum bulk load size of 45 MBs, and a maximum 
number of workers of 20 which results in 40 threads on Dataflow. 40 threads 
times 45 MBs gives 1 800 MBs of concurrent indexing data. Bulk requests are 
load balanced between the two coordinator nodes, which split and route the 
requests to respective data nodes based on document routing.
   
   By default, the ES coordinator nodes had allocated 50 % of memory to 
application heap, and the other 50 % to OS file system caching. As noted by the 
linked blog entry, the default limit for concurrent indexing data is 10 % of 
the application heap. In absolute numbers, this means the default limit for 
concurrent indexing data was about 600 MBs per coordinator node.
   
   I came across the linked blog entry as part of my troubleshooting, and 
adjusted my coordinator nodes as a result. Specifically, I raised the heap size 
from 6 GBs to 10 GBs, and raised `indexing_pressure.memory.limit` from 10 % to 
20 % since the coordinators do nothing but split and forward requests. With 
this, the limit per coordinator becomes 2 GBs. I have not yet had the time to 
run my pipeline with this configuration, but it seems likely that this will 
work without any rejected requests since each coordinator on its own is able to 
fit the full 1 800 MBs.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to