[GitHub] [incubator-devlake] klesh opened a new issue, #3822: [Feature][Infra] ApiCollectors to resume collected data from previous failed collection

GitBox Wed, 30 Nov 2022 02:24:11 -0800


klesh opened a new issue, #3822:
URL: https://github.com/apache/incubator-devlake/issues/3822


   ### Search before asking
   
   - [X] I had searched in the 
[issues](https://github.com/apache/incubator-devlake/issues?q=is%3Aissue) and 
found no similar feature requirement.
   
   
   ### Use case
   
   1. Shorten overall collection time when collection failed
   2. Better user experience in terms of robustness
   
   ### Description
   
   Many things could go wrong during data collection, network problems, server 
crashes, etc. When it happens, the collected data would be recollected once 
again, which is wasteful and bad for the user experience.
   
   Some may say we have `diff-sync` mechanism for most of our important 
plugins, yes, we do, however, it relies on Extracted Data which is not really 
helpful when Collection failed.
   
   To be more specific, let's take collecting 100 pages of jira issues as an 
example, if the server went down on the 50th page, users may wait for the 
server to come back, and then start another pipeline to collect data, one would 
certainly wish the Apache DevLake would pick up from 51 pages.
   
   
   It sounds all good and smooth, however, there are some catches we need to 
take care:
   1. We collect data in parallel, so failing on the 50th page doesn't mean we 
can pick it up there.
   2. `diff-sync` should be replaced if we opt for the `updated_time-based` 
strategy
   3. The records order from API response must be consistent, which depends on 
the data source API specification
   4. How do we know it is legitimate to pick up previous data? based on what?
   
   ### Related issues
   
   _No response_
   
   ### Are you willing to submit a PR?
   
   - [ ] Yes I am willing to submit a PR!
   
   ### Code of Conduct
   
   - [X] I agree to follow this project's [Code of 
Conduct](https://www.apache.org/foundation/policies/conduct)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [incubator-devlake] klesh opened a new issue, #3822: [Feature][Infra] ApiCollectors to resume collected data from previous failed collection

Reply via email to