nehsyc commented on a change in pull request #14723:
URL: https://github.com/apache/beam/pull/14723#discussion_r641089715



##########
File path: sdks/python/apache_beam/io/gcp/datastore/v1new/datastoreio.py
##########
@@ -276,15 +277,33 @@ class _Mutate(PTransform):
   Only idempotent Datastore mutation operations (upsert and delete) are
   supported, as the commits are retried when failures occur.
   """
-  def __init__(self, mutate_fn):
+
+  # Default hint for the expected number of workers in the ramp-up throttling
+  # step for write or delete operations.
+  _DEFAULT_HINT_NUM_WORKERS = 500

Review comment:
       Thanks for bumping this! I indeed missed the edits above.
   
   It also appears weird to me that Dataflow didn't upscale for two hours 
despite the plateaued throttling time. My best guesses would be that the 
workers didn't report much progress causing Dataflow to stick to the initial 
scaling of the job, or the low CPU utilization prevented Dataflow to upscale.
   
   My previous concern about reporting throttling signals was mainly the slow 
ramp-up of number of workers especially when we started with very little budget 
for each worker. If that is not as concerning as I imagined then I vote for 
reporting `throttling-msecs`.
   
   Most Dataflow pipelines start with a small number of workers (<10) so by 
setting the default value to 500 we almost always over-constrain the rate by a 
lot at the beginning. Maybe consider lowering it.
   
   I am not familiar with how counters work in Python. Maybe @chamikaramj can 
comment more on that.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]


Reply via email to