kyleliu1008 commented on issue #5493: URL: https://github.com/apache/incubator-devlake/issues/5493#issuecomment-1614032698
In fact, there is another phenomenon related to the data in the "_raw_gitlab_api_job" table: when the pipeline is in a non-completed state, devlake will repeatedly collect all jobs under the pipeline, causing the jobs data corresponding to the pipeline in the "_raw_gitlab_api_job" table to double each time it is collected, leading to the increasing size of the table. This problem becomes more prominent when there are a large number of blocked pipelines or when pipelines have long lifecycles (of course, these issues should be improved in CI practices). Of course, the raw table is meant to record the originally collected data, but the data volume of the devlake table I deployed quickly reached millions. This is related to the fact that our project generates a lot of invalid pipelines (which we are also working on resolving). If devlake could recognize that the pipeline and its corresponding job have not changed and not collect them again, it would be perfect. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
