narrowizard commented on issue #8523:
URL: 
https://github.com/apache/incubator-devlake/issues/8523#issuecomment-3162644089

   After further testing, we have confirmed that the situation with GitLab 
Server also aligns with the current design. In a GitLab Server instance with 
over 3000 users, integrating more than 2000 repositories results in each user 
record being duplicated over 2000 times in the `_raw_gitlab_api_users` table. 
This leads to a significant increase in data volume.
   
   This finding highlights the existing issue in our current design, where the 
raw layer does not perform deduplication, causing the data volume to balloon 
due to the excessive number of duplicate records. This not only puts pressure 
on storage but also complicates subsequent data processing tasks.
   
   We need to continue exploring more effective solutions to address this issue 
and improve our data collection and processing efficiency.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@devlake.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Reply via email to