narrowizard commented on issue #8523: URL: https://github.com/apache/incubator-devlake/issues/8523#issuecomment-3162644089
After further testing, we have confirmed that the situation with GitLab Server also aligns with the current design. In a GitLab Server instance with over 3000 users, integrating more than 2000 repositories results in each user record being duplicated over 2000 times in the `_raw_gitlab_api_users` table. This leads to a significant increase in data volume. This finding highlights the existing issue in our current design, where the raw layer does not perform deduplication, causing the data volume to balloon due to the excessive number of duplicate records. This not only puts pressure on storage but also complicates subsequent data processing tasks. We need to continue exploring more effective solutions to address this issue and improve our data collection and processing efficiency. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: dev-unsubscr...@devlake.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org