[jira] [Work logged] (GOBBLIN-1783) Initialize scheduler with batch gets instead of individual get per flow

ASF GitHub Bot (Jira) Mon, 13 Feb 2023 11:30:59 -0800


     [ 
https://issues.apache.org/jira/browse/GOBBLIN-1783?focusedWorklogId=845212&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-845212
 ]


ASF GitHub Bot logged work on GOBBLIN-1783:
-------------------------------------------

                Author: ASF GitHub Bot
            Created on: 13/Feb/23 19:29
            Start Date: 13/Feb/23 19:29
    Worklog Time Spent: 10m 
      Work Description: umustafi commented on PR #3640:
URL: https://github.com/apache/gobblin/pull/3640#issuecomment-1428533867

   > 
   
   1. Current implementation, adds scheduler then the specConsumer to list of 
services. I considered switching order but scheduler needs to be initialized 
before consuming specs and trying to add to scheduler. Need to confirm if 
services are initialized in that other or done concurrently. specConsumer 
starts consuming from the latestOffset so this should not miss any specs. The 
offset won't move along unless service is up and able to accept requests and 
our consumer is processing. 
   
   2. The problem can come up if we are loading flowSpecA from old value and 
while processing that batch there's API request to update flow and consumer 
calls onAdd with a newer value first, then scheduler calls with old value. It's 
very rare but we may want to add modified timestamp to avoid. This technically 
_could_ have happened in previous case although much more rare chance with the 
individual gets that in between get and add spec the consumer processed a newer 
spec version. If we want to use modification time, need to make bigger change 
to store modified time with spec in `DagManager` or `Scheduler` itself perhaps.




Issue Time Tracking
-------------------

    Worklog Id:     (was: 845212)
    Time Spent: 50m  (was: 40m)

> Initialize scheduler with batch gets instead of individual get per flow
> -----------------------------------------------------------------------
>
>                 Key: GOBBLIN-1783
>                 URL: https://issues.apache.org/jira/browse/GOBBLIN-1783
>             Project: Apache Gobblin
>          Issue Type: Bug
>          Components: gobblin-service
>            Reporter: Urmi Mustafi
>            Assignee: Abhishek Tiwari
>            Priority: Major
>          Time Spent: 50m
>  Remaining Estimate: 0h
>
> We seek to improve initialization time of the JobScheduler upon restart or 
> new leadership change by batching the mysql queries to get flow specs. 
> Instead of making 1 mysql get call for each flow execution id, which scales 
> extremely poorly with number of flows, we should group them to reduce number 
> of calls and downtime.
> This implementation adds two new functions to the SpecStore interface, 
> getSortedSpecURIs and getBatchedSpecs, that we use to achieve the batching. 
> Because these two functionalities are generic enough to be used in derived 
> classes of the SpecStore we add them to the base class. Although this 
> requires any child classes to implement these functions, it allows any 
> consumer of the parent class SpecStore to use this functionality without 
> caring about the specific implementation of the SpecStore used (as 
> JobScheduler does). Additionally, the getBatchedSpecs requires an offset or 
> starting point to obtain the batches from so the consumer has to do some book 
> keeping of where in the paginated gets we are but this again separates the 
> functionality from the use case of the consumer. the entirety of the flow 
> catalog is too large to load into memory for the Scheduler, so we use this 
> batch functionality. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Work logged] (GOBBLIN-1783) Initialize scheduler with batch gets instead of individual get per flow

Reply via email to