klesh commented on issue #5493: URL: https://github.com/apache/incubator-devlake/issues/5493#issuecomment-1617206578
> In fact, there is another phenomenon related to the data in the "_raw_gitlab_api_job" table: when the pipeline is in a non-completed state, devlake will repeatedly collect all jobs under the pipeline, causing the jobs data corresponding to the pipeline in the "_raw_gitlab_api_job" table to double each time it is collected, leading to the increasing size of the table. This problem becomes more prominent when there are a large number of blocked pipelines or when pipelines have long lifecycles (of course, these issues should be improved in CI practices). > > Of course, the raw table is meant to record the originally collected data, but the data volume of the devlake table I deployed quickly reached millions. This is related to the fact that our project generates a lot of invalid pipelines (which we are also working on resolving). If devlake could recognize that the pipeline and its corresponding job have not changed and not collect them again, it would be perfect. @kyleliu1008 Yes, I understand your point perfectly. However our hands are tied when Gitlab API doesn't support filtering records by `updated` field, the idea of "devlake could recognize that the pipeline and its corresponding job have not changed and not collect them again" is logically impossible, we are simply unable to predict whether or not a `pipeline/job` has been updated without fetching them from the API. In case you may wonder if we could compare the fetched record against the database before writing to the `_raw_table`, the answer is NO since it would create a cyclic dependency between subtasks which would create problems for #5411 Anyway, it won't be a problem if we collect only those FINISHED records, which are supported by the Gitlab API by the `scope` parameter. Do you think it is ok for your case? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
