[GitHub] [druid] maytasm opened a new pull request #9965: API to verify a datasource has the latest ingested data

GitBox Mon, 01 Jun 2020 21:14:08 -0700


maytasm opened a new pull request #9965:
URL: https://github.com/apache/druid/pull/9965



   API to verify a datasource has the latest ingested data
   
   ### Description
   
   This PR address https://github.com/apache/druid/issues/5721
   
   The existing loadstatus API reads segments from SqlSegmentsMetadataManager 
of the Coordinator which caches segments in memory and periodically updates 
them. Hence, there can be a race condition as this API implementation compares 
segments metadata from the mentioned cache with published segments in 
historicals. Particularly, when there is a new ingestion after the initial load 
of the datasource, the cache still only contains the metadata of old segments. 
The API would compares list of old segments with what is published by 
historical and returns that everything is available when the new segments are 
not actually available yet. 
   
   This new API will fix this problem. The new API will be able to do the 
following:
   - new api takes in datasource. This will returns false if any used segment 
(of the past 2 weeks) of the given datasource are not available to be query 
(i.e. not loaded onto historical yet). Return true otherwise. The interval of 2 
weeks above is not finalized yet. We can decide later what is a good default 
number
   
   - (same) new api takes in datasource and a time interval (start + end): This 
will returns false if any used segment (between the given start and given end 
time) of the given datasource are not available to be query (i.e. not loaded 
onto historical yet). Return true otherwise.
   
   Note that the above are both the same API. The time interval is an optional 
parameter. The time interval referred above is the timestamp of the data in the 
segment (nothing to do with when the segment is ingested). This can be the same 
time interval as the time interval the user want to query data from. Basically 
if the user wants to query from x to y then they can call this new api with the 
datasource and time interval x to y. This will ensure that all segments of the 
datasource for the timestamp from x to y is ready to be query (loaded onto 
historical).
   
   Important differencees between this API from the existing coordinator 
loadstatus API:
   - Takes datasource (required) to be able to check faster (iterate smaller 
number of segments)
   - Takes interval (optional) to be able to check faster (iterate smaller 
number of segments)
   - **IMPORATANT**. Takes boolean firstCheck. If this is true, this will force 
poll the metadata source to get latest published segment information.
   
   The workflow will be :
   
   1) submit ingestion task
   
   2) poll task api until task succeeded
   
   3) poll the new api with datasource, interval, and firstCheck=true once. If 
false, go to step 4, otherwise the data is available and user can query.
   
   4) poll the new api with datasource, interval, and firstCheck=false until 
return true. After true, data is available and user can query.
   
   This PR has:
   - [x] been self-reviewed.
      - [ ] using the [concurrency 
checklist](https://github.com/apache/druid/blob/master/dev/code-review/concurrency.md)
 (Remove this item if the PR doesn't have any relation to concurrency.)
   - [ ] added documentation for new or modified features or behaviors.
   - [ ] added Javadocs for most classes and all non-trivial methods. Linked 
related entities via Javadoc links.
   - [ ] added or updated version, license, or notice information in 
[licenses.yaml](https://github.com/apache/druid/blob/master/licenses.yaml)
   - [ ] added comments explaining the "why" and the intent of the code 
wherever would not be obvious for an unfamiliar reader.
   - [ ] added unit tests or modified existing tests to cover new code paths.
   - [ ] added integration tests.
   - [ ] been tested in a test Druid cluster.
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [druid] maytasm opened a new pull request #9965: API to verify a datasource has the latest ingested data

Reply via email to