[ 
https://issues.apache.org/jira/browse/GOBBLIN-1783?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Urmi Mustafi updated GOBBLIN-1783:
----------------------------------
    Description: 
We seek to improve initialization time of the JobScheduler upon restart or new 
leadership change by batching the mysql queries to get flow specs. Instead of 
making 1 mysql get call for each flow execution id, which scales extremely 
poorly with number of flows, we should group them to reduce number of calls and 
downtime.

This implementation adds two new functions to the SpecStore interface, 
getSortedSpecURIs and getBatchedSpecs, that we use to achieve the batching. 
Because these two functionalities are generic enough to be used in derived 
classes of the SpecStore we add them to the base class. Although this requires 
any child classes to implement these functions, it allows any consumer of the 
parent class SpecStore to use this functionality without caring about the 
specific implementation of the SpecStore used (as JobScheduler does). 
Additionally, the getBatchedSpecs requires an offset or starting point to 
obtain the batches from so the consumer has to do some book keeping of where in 
the paginated gets we are but this again separates the functionality from the 
use case of the consumer. the entirety of the flow catalog is too large to load 
into memory for the Scheduler, so we use this batch functionality. 

  was:
We seek to improve initialization time of the JobScheduler upon restart or new 
leadership change by batching the mysql queries to get flow specs. Instead of 
making 1 mysql get call for each flow execution id, which scales extremely 
poorly with number of flows, we should group them to reduce number of calls and 
downtime.

This implementation adds two new functions to the SpecStore interface, 
getSortedSpecs and getBatchedSpecs, that we use to achieve the batching. 
Because these two functionalities are generic enough to be used in derived 
classes of the SpecStore we add them to the base class. Although this requires 
any child classes to implement these functions, it allows any consumer of the 
parent class SpecStore to use this functionality without caring about the 
specific implementation of the SpecStore used (as JobScheduler does). 
Additionally, the getBatchedSpecs requires an offset or starting point to 
obtain the batches from so the consumer has to do some book keeping of where in 
the paginated gets we are but this again separates the functionality from the 
use case of the consumer. the entirety of the flow catalog is too large to load 
into memory for the Scheduler, so we use this batch functionality. 


> Initialize scheduler with batch gets instead of individual get per flow
> -----------------------------------------------------------------------
>
>                 Key: GOBBLIN-1783
>                 URL: https://issues.apache.org/jira/browse/GOBBLIN-1783
>             Project: Apache Gobblin
>          Issue Type: Bug
>          Components: gobblin-service
>            Reporter: Urmi Mustafi
>            Assignee: Abhishek Tiwari
>            Priority: Major
>
> We seek to improve initialization time of the JobScheduler upon restart or 
> new leadership change by batching the mysql queries to get flow specs. 
> Instead of making 1 mysql get call for each flow execution id, which scales 
> extremely poorly with number of flows, we should group them to reduce number 
> of calls and downtime.
> This implementation adds two new functions to the SpecStore interface, 
> getSortedSpecURIs and getBatchedSpecs, that we use to achieve the batching. 
> Because these two functionalities are generic enough to be used in derived 
> classes of the SpecStore we add them to the base class. Although this 
> requires any child classes to implement these functions, it allows any 
> consumer of the parent class SpecStore to use this functionality without 
> caring about the specific implementation of the SpecStore used (as 
> JobScheduler does). Additionally, the getBatchedSpecs requires an offset or 
> starting point to obtain the batches from so the consumer has to do some book 
> keeping of where in the paginated gets we are but this again separates the 
> functionality from the use case of the consumer. the entirety of the flow 
> catalog is too large to load into memory for the Scheduler, so we use this 
> batch functionality. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to