klesh commented on issue #4247:
URL: 
https://github.com/apache/incubator-devlake/issues/4247#issuecomment-1455819161

   However, simply depending on the 
`_devlake_collector_latest_state.latest_success_start` is not reliable.
   
   ## The following factors must be considered:
   
   1. The `extractors` and `converters` are working under the `delete and 
insert` principle without any knowledge of its preceding `subtasks`
   2. Users might collect data multiple times without `extraction` or 
`conversion` with the current design.
   3. The relationship between `collectors`, `extractors`, and `converters` are 
**NOT** 1:1:1. remember that some `subtasks` might produce multiple kinds of 
records, for example, the `jira issue extractor` produces `issues` and 
`changelog` and others. vice-verse, a set of records of a specific scope might 
come from multiple upstream `subtasks`, for example, `changelog` could come 
from `issue collector` or `changelog collector`. The dependency could be quite 
messy if we depend only on the `_devlake_collector_latest_state`
   
   In summary, it is hard and unreliable to distinguish `Incremental` and 
`FullRefresh` by examining the state of the preceding `collector`.
   
   ## Proposal
   
   I think it would easier for us to track the state of  `subtasks` 
(`extractors` and `converters`) without introducing dependency.
   
   1. Each `extractor` or `converter` should have its own state
   2. The `ExtractorHelper` and `ConverterHelper` should support the 
`IsIncremental` option and determine whether to delete the existing records or 
not, just like the `CollectorHelper`. However, the `IsIncremental` conditions 
are different.
   3. The `IsIncremental` for  `ExtractorHelper` and `ConverterHelper`  can be 
done by simply comparing its `latest_success_start` with the `max(created_at)` 
(the `created_at` represents the time of the record being created in the 
database)
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to