narrowizard opened a new issue, #8523:
URL: https://github.com/apache/incubator-devlake/issues/8523

   ### Search before asking
   
   - [x] I had searched in the 
[issues](https://github.com/apache/incubator-devlake/issues?q=is%3Aissue) and 
found no similar issues.
   
   
   ### What happened
   
   We integrated a GitLab data source with over 2000 repositories into Apache 
DevLake. After running the data collection pipeline for a period, we observed 
that the `_raw_gitlab_api_users` table has grown to an unexpectedly large size, 
reaching approximately 21GB. Upon querying the table, we found a significant 
number of duplicate records for user data. This suggests that the raw layer 
data cleaning process for GitLab API users might not be functioning effectively.
   
   ### What do you expect to happen
   
   We expect the `_raw_gitlab_api_users` table to contain unique user records 
without excessive duplicates. The table size should be reasonable and reflect 
the actual number of unique users, not inflated by redundant entries. The raw 
layer data cleaning mechanism should correctly deduplicate user data ingested 
from GitLab.
   
   ### How to reproduce
   
   1.  Integrate a GitLab data source in Apache DevLake.
   2.  Configure the data source to collect data from a large number of 
repositories (e.g., 2000+).
   3.  Run the data collection pipeline for an extended period (e.g., several 
days or weeks).
   4.  Monitor the size of the `_raw_gitlab_api_users` table in your database.
   5.  Query the `_raw_gitlab_api_users` table to identify and count duplicate 
user records.
   
   ### Anything else
   
   _No response_
   
   ### Version
   
   main
   
   ### Are you willing to submit PR?
   
   - [ ] Yes I am willing to submit a PR!
   
   ### Code of Conduct
   
   - [x] I agree to follow this project's [Code of 
Conduct](https://www.apache.org/foundation/policies/conduct)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@devlake.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Reply via email to