klesh commented on issue #4247: URL: https://github.com/apache/incubator-devlake/issues/4247#issuecomment-1455819161
However, simply depending on the `_devlake_collector_latest_state.latest_success_start` is not reliable. ## The following factors must be considered: 1. The `extractors` and `converters` are working under the `delete and insert` principle without any knowledge of its preceding `subtasks` 2. Users might collect data multiple times without `extraction` or `conversion` with the current design. 3. The relationship between `collectors`, `extractors`, and `converters` are **NOT** 1:1:1. remember that some `subtasks` might produce multiple kinds of records, for example, the `jira issue extractor` produces `issues` and `changelog` and others. vice-verse, a set of records of a specific scope might come from multiple upstream `subtasks`, for example, `changelog` could come from `issue collector` or `changelog collector`. The dependency could be quite messy if we depend only on the `_devlake_collector_latest_state` In summary, it is hard and unreliable to distinguish `Incremental` and `FullRefresh` by examining the state of the preceding `collector`. ## Proposal I think it would easier for us to track the state of `subtasks` (`extractors` and `converters`) without introducing dependency. 1. Each `extractor` or `converter` should have its own state 2. The `ExtractorHelper` and `ConverterHelper` should support the `IsIncremental` option and determine whether to delete the existing records or not, just like the `CollectorHelper`. However, the `IsIncremental` conditions are different. 3. The `IsIncremental` for `ExtractorHelper` and `ConverterHelper` can be done by simply comparing its `latest_success_start` with the `max(created_at)` (the `created_at` represents the time of the record being created in the database) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
