dosubot[bot] commented on issue #8529: URL: https://github.com/apache/incubator-devlake/issues/8529#issuecomment-3166612215
<!-- Answer --> The _raw_gitlab_api_users table grows large and accumulates duplicates because DevLake collects user data for every repository, especially on GitLab Server. There’s no built-in deduplication, so you’ll need to manage cleanup manually or limit how often users are collected. **To deduplicate or clean up the table:** 1. **Back up your database first.** 2. For a one-time manual cleanup, you can use SQL to remove duplicates. Here’s an example for PostgreSQL (adjust table name if needed): ```sql DELETE FROM _raw_gitlab_api_users a USING _raw_gitlab_api_users b WHERE a.ctid < b.ctid AND a.id = b.id; ``` For MySQL, you might use: ```sql DELETE t1 FROM _raw_gitlab_api_users t1 INNER JOIN _raw_gitlab_api_users t2 WHERE t1.id = t2.id AND t1.primary_key_column > t2.primary_key_column; ``` Replace `primary_key_column` with your actual primary key. 3. To automate, schedule this SQL as a cron job or use your DB’s scheduler. **To limit how often users are collected:** The most effective workaround is to refactor your pipeline so user collection happens only once per pipeline run, not per repository. This involves: - Adding a first stage that collects, extracts, and converts users. - Removing duplicate user collection tasks from later stages. - This approach has been tested and shown to reduce table size and duplication significantly ([details and example](https://github.com/apache/incubator-devlake/issues/8283), [more context](https://github.com/apache/incubator-devlake/issues/8523)). There’s no config or environment variable for this in 1.0.2-beta5, but a proposed `GITLAB_SERVER_COLLECT_ALL_USERS` variable may help in future versions. Always test cleanup on a staging environment first and check for unique constraints before running destructive queries. <!-- Dosu Comment Footer --> *To reply, just mention [@dosu](https://go.dosu.dev/dosubot).* --- <sup>How did I do? [Good](https://app.dosu.dev/response-feedback/94786185-8a80-4c4b-b583-2ab4b66f7fd9?feedback_type=great_response) | [Irrelevant](https://app.dosu.dev/response-feedback/94786185-8a80-4c4b-b583-2ab4b66f7fd9?feedback_type=irrelevant_answer) | [Incorrect](https://app.dosu.dev/response-feedback/94786185-8a80-4c4b-b583-2ab4b66f7fd9?feedback_type=incorrect_sources) | [Verbose](https://app.dosu.dev/response-feedback/94786185-8a80-4c4b-b583-2ab4b66f7fd9?feedback_type=too_verbose) | [Hallucination](https://app.dosu.dev/response-feedback/94786185-8a80-4c4b-b583-2ab4b66f7fd9?feedback_type=hallucination) | [Report 🐛](https://app.dosu.dev/response-feedback/94786185-8a80-4c4b-b583-2ab4b66f7fd9?feedback_type=bug_report) | [Other](https://app.dosu.dev/response-feedback/94786185-8a80-4c4b-b583-2ab4b66f7fd9?feedback_type=other)</sup> [](https://app.dosu.dev/b4e8e847-d479-4541-83a8-d88d83fea5c9/ask?utm_source=githu b) [](https://go.dosu.dev/discord-bot) [](https://twitter.com/intent/tweet?text=%40dosu_ai%20helped%20me%20solve%20this%20issue!&url=https%3A//github.com/apache/incubator-devlake/issues/8529) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: dev-unsubscr...@devlake.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org