[jira] [Work logged] (BEAM-12272) Python - Backport FirestoreIO connector's ramp-up to DatastoreIO connector

ASF GitHub Bot (Jira) Tue, 25 May 2021 10:00:17 -0700


     [ 
https://issues.apache.org/jira/browse/BEAM-12272?focusedWorklogId=601854&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-601854
 ]


ASF GitHub Bot logged work on BEAM-12272:
-----------------------------------------

                Author: ASF GitHub Bot
            Created on: 25/May/21 17:00
            Start Date: 25/May/21 17:00
    Worklog Time Spent: 10m 
      Work Description: danthev commented on a change in pull request #14723:
URL: https://github.com/apache/beam/pull/14723#discussion_r638246638



##########
File path: sdks/python/apache_beam/io/gcp/datastore/v1new/datastoreio.py
##########
@@ -276,15 +277,33 @@ class _Mutate(PTransform):
   Only idempotent Datastore mutation operations (upsert and delete) are
   supported, as the commits are retried when failures occur.
   """
-  def __init__(self, mutate_fn):
+
+  # Default hint for the expected number of workers in the ramp-up throttling
+  # step for write or delete operations.
+  _DEFAULT_HINT_NUM_WORKERS = 500

Review comment:
       That's a good point, simply keeping or dropping `throttling-msecs` might 
under- or overshoot the target... 
   
   The hint is already configurable, and with the warning messages and ramp-up 
as a separate step it should be easy for the user to recognize the throttling 
and adjust the value. Even egregious misconfiguration becomes irrelevant after 
~1 hour as the ramp-up is exponential, so expected impact is low.  
   It's also possible to turn off ramp-up, though that is obviously not 
recommended. 
   
   However, the dynamic worker count in case of autoscaling certainly 
complicates this. The crux of the issue is that we can't get an up-to-date 
worker count from the runner or Beam in general, so we're going by a 
(configurable) rough estimate. I've looked into auto-filling the value from 
`maxNumWorkers` if the runner is Dataflow, but that introduces a circular 
dependency at least in Java.  
   All of this should be fine I think if autoscaling scales to the maximum in a 
reasonable amount of time as it did in my experiment, but I didn't consider the 
possibility of overscaling if there is no limit. My other experiment that 
reported `throttling-msecs` had `maxNumWorkers` as well as the hint set to 50, 
so the budget wasn't quite as large, but I did have basically 2.5 hours of just 
one worker. Weirdly though as I just found out, `throttling-msecs` actually 
started stagnating (as expected), but Dataflow still didn't scale until 2 hours 
later. Is there a long memory on that metric, and is that something that could 
be fixed by reducing reporting frequency?  
   This is what this looked like on the "report throttling-msecs" experiment 
(note the timestamps): 
   
![image](https://user-images.githubusercontent.com/76013657/119402371-b3ef0d00-bc91-11eb-8ef9-ab7766814131.png)
   I'd be happy to run more experiments or jump on GVC to explain my 
observations if you'd like.
   
   Edit:
   After another day of testing various settings, I have to slightly revise my 
findings. It seems reporting throttling-msecs doesn't actually affect scaling 
that much, but autoscaling with Datastore sometimes simply doesn't get enough 
signals, whether with ramp-up or not, likely because CPU usage stayed 
relatively low. I now have experiments with no ramp-up that stay at 1 worker, 
and I have experiments that report throttling-msecs easily scaling up. It 
appears to be more of a general autoscaling issue, so I think I would keep 
`throttling-msecs` in this PR.  
   I also haven't observed any issues when starting with 2 or more workers, 
apparently that gives enough signals to behave normally.
   




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]


Issue Time Tracking
-------------------

    Worklog Id:     (was: 601854)
    Time Spent: 3h  (was: 2h 50m)

> Python - Backport FirestoreIO connector's ramp-up to DatastoreIO connector
> --------------------------------------------------------------------------
>
>                 Key: BEAM-12272
>                 URL: https://issues.apache.org/jira/browse/BEAM-12272
>             Project: Beam
>          Issue Type: Improvement
>          Components: io-py-gcp
>            Reporter: Daniel Thevessen
>            Assignee: Daniel Thevessen
>            Priority: P2
>          Time Spent: 3h
>  Remaining Estimate: 0h
>
> The FirestoreIO connector (BEAM-8376) for the Java SDK is currently out for 
> review.  
> Some of the features for the Firestore connector are new relative to the 
> Datastore connector, so the Firestore team would like to add them there as 
> well. Most notably the Firestore connector has a gradual ramp-up feature that 
> slowly increases throughput in line with Datastore/Firestore best practices.  
> Some minor adjustments to constants for batch sizing and backoff may also be 
> worth looking at.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Work logged] (BEAM-12272) Python - Backport FirestoreIO connector's ramp-up to DatastoreIO connector

Reply via email to