narrowizard opened a new issue, #8523: URL: https://github.com/apache/incubator-devlake/issues/8523
### Search before asking - [x] I had searched in the [issues](https://github.com/apache/incubator-devlake/issues?q=is%3Aissue) and found no similar issues. ### What happened We integrated a GitLab data source with over 2000 repositories into Apache DevLake. After running the data collection pipeline for a period, we observed that the `_raw_gitlab_api_users` table has grown to an unexpectedly large size, reaching approximately 21GB. Upon querying the table, we found a significant number of duplicate records for user data. This suggests that the raw layer data cleaning process for GitLab API users might not be functioning effectively. ### What do you expect to happen We expect the `_raw_gitlab_api_users` table to contain unique user records without excessive duplicates. The table size should be reasonable and reflect the actual number of unique users, not inflated by redundant entries. The raw layer data cleaning mechanism should correctly deduplicate user data ingested from GitLab. ### How to reproduce 1. Integrate a GitLab data source in Apache DevLake. 2. Configure the data source to collect data from a large number of repositories (e.g., 2000+). 3. Run the data collection pipeline for an extended period (e.g., several days or weeks). 4. Monitor the size of the `_raw_gitlab_api_users` table in your database. 5. Query the `_raw_gitlab_api_users` table to identify and count duplicate user records. ### Anything else _No response_ ### Version main ### Are you willing to submit PR? - [ ] Yes I am willing to submit a PR! ### Code of Conduct - [x] I agree to follow this project's [Code of Conduct](https://www.apache.org/foundation/policies/conduct) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: dev-unsubscr...@devlake.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org