[
https://issues.apache.org/jira/browse/GOBBLIN-1783?focusedWorklogId=845231&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-845231
]
ASF GitHub Bot logged work on GOBBLIN-1783:
-------------------------------------------
Author: ASF GitHub Bot
Created on: 13/Feb/23 23:20
Start Date: 13/Feb/23 23:20
Worklog Time Spent: 10m
Work Description: AndyJiang99 commented on code in PR #3640:
URL: https://github.com/apache/gobblin/pull/3640#discussion_r1105121542
##########
gobblin-runtime/src/main/java/org/apache/gobblin/runtime/spec_store/MysqlBaseSpecStore.java:
##########
@@ -84,6 +84,7 @@ public class MysqlBaseSpecStore extends InstrumentedSpecStore
{
private static final String GET_ALL_STATEMENT = "SELECT spec_uri, spec FROM
%s";
private static final String GET_ALL_URIS_STATEMENT = "SELECT spec_uri FROM
%s";
private static final String GET_ALL_URIS_WITH_TAG_STATEMENT = "SELECT
spec_uri FROM %s WHERE tag = ?";
+ private static final String GET_SPECS_BATCH_STATEMENT = "SELECT spec_uri,
spec FROM %s ORDER BY spec_uri ASC LIMIT ? OFFSET ?";
Review Comment:
1. Was there any reason why spec_json was removed from the query string?
2. Using this query, it will error in the scenario the OFFSET or LIMIT was
set to a negative value as MySQL queries cannot handle those and this [if
block](https://github.com/apache/gobblin/pull/3640/files#diff-900cc8e4e863a8057e0e808230a2f9c3c169048059d62a5d94ed9beb94360c12L283-L293)
handles such cases. We'll need to add something similar in to handle those
cases.
Issue Time Tracking
-------------------
Worklog Id: (was: 845231)
Time Spent: 1h 10m (was: 1h)
> Initialize scheduler with batch gets instead of individual get per flow
> -----------------------------------------------------------------------
>
> Key: GOBBLIN-1783
> URL: https://issues.apache.org/jira/browse/GOBBLIN-1783
> Project: Apache Gobblin
> Issue Type: Bug
> Components: gobblin-service
> Reporter: Urmi Mustafi
> Assignee: Abhishek Tiwari
> Priority: Major
> Time Spent: 1h 10m
> Remaining Estimate: 0h
>
> We seek to improve initialization time of the JobScheduler upon restart or
> new leadership change by batching the mysql queries to get flow specs.
> Instead of making 1 mysql get call for each flow execution id, which scales
> extremely poorly with number of flows, we should group them to reduce number
> of calls and downtime.
> This implementation adds two new functions to the SpecStore interface,
> getSortedSpecURIs and getBatchedSpecs, that we use to achieve the batching.
> Because these two functionalities are generic enough to be used in derived
> classes of the SpecStore we add them to the base class. Although this
> requires any child classes to implement these functions, it allows any
> consumer of the parent class SpecStore to use this functionality without
> caring about the specific implementation of the SpecStore used (as
> JobScheduler does). Additionally, the getBatchedSpecs requires an offset or
> starting point to obtain the batches from so the consumer has to do some book
> keeping of where in the paginated gets we are but this again separates the
> functionality from the use case of the consumer. the entirety of the flow
> catalog is too large to load into memory for the Scheduler, so we use this
> batch functionality.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)