kyleliu1008 commented on issue #5493:
URL: 
https://github.com/apache/incubator-devlake/issues/5493#issuecomment-1614032698

   In fact, there is another phenomenon related to the data in the 
"_raw_gitlab_api_job" table: when the pipeline is in a non-completed state, 
devlake will repeatedly collect all jobs under the pipeline, causing the jobs 
data corresponding to the pipeline in the "_raw_gitlab_api_job" table to double 
each time it is collected, leading to the increasing size of the table. This 
problem becomes more prominent when there are a large number of blocked 
pipelines or when pipelines have long lifecycles (of course, these issues 
should be improved in CI practices).
   
   Of course, the raw table is meant to record the originally collected data, 
but the data volume of the devlake table I deployed quickly reached millions. 
This is related to the fact that our project generates a lot of invalid 
pipelines (which we are also working on resolving). If devlake could recognize 
that the pipeline and its corresponding job have not changed and not collect 
them again, it would be perfect.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to