dosubot[bot] commented on issue #8529:
URL: 
https://github.com/apache/incubator-devlake/issues/8529#issuecomment-3166612215

   <!-- Answer -->
   The _raw_gitlab_api_users table grows large and accumulates duplicates 
because DevLake collects user data for every repository, especially on GitLab 
Server. There’s no built-in deduplication, so you’ll need to manage cleanup 
manually or limit how often users are collected.
   
   **To deduplicate or clean up the table:**
   1. **Back up your database first.**
   2. For a one-time manual cleanup, you can use SQL to remove duplicates. 
Here’s an example for PostgreSQL (adjust table name if needed):
   
   ```sql
   DELETE FROM _raw_gitlab_api_users a
   USING _raw_gitlab_api_users b
   WHERE
     a.ctid < b.ctid
     AND a.id = b.id;
   ```
   For MySQL, you might use:
   ```sql
   DELETE t1 FROM _raw_gitlab_api_users t1
   INNER JOIN _raw_gitlab_api_users t2
   WHERE
     t1.id = t2.id
     AND t1.primary_key_column > t2.primary_key_column;
   ```
   Replace `primary_key_column` with your actual primary key.
   
   3. To automate, schedule this SQL as a cron job or use your DB’s scheduler.
   
   **To limit how often users are collected:**
   The most effective workaround is to refactor your pipeline so user 
collection happens only once per pipeline run, not per repository. This 
involves:
   - Adding a first stage that collects, extracts, and converts users.
   - Removing duplicate user collection tasks from later stages.
   - This approach has been tested and shown to reduce table size and 
duplication significantly ([details and 
example](https://github.com/apache/incubator-devlake/issues/8283), [more 
context](https://github.com/apache/incubator-devlake/issues/8523)).
   
   There’s no config or environment variable for this in 1.0.2-beta5, but a 
proposed `GITLAB_SERVER_COLLECT_ALL_USERS` variable may help in future versions.
   
   Always test cleanup on a staging environment first and check for unique 
constraints before running destructive queries.
   
   
   <!-- Dosu Comment Footer -->
   *To reply, just mention [@dosu](https://go.dosu.dev/dosubot).*
   
   ---
   <sup>How did I do? 
[Good](https://app.dosu.dev/response-feedback/94786185-8a80-4c4b-b583-2ab4b66f7fd9?feedback_type=great_response)
 | 
[Irrelevant](https://app.dosu.dev/response-feedback/94786185-8a80-4c4b-b583-2ab4b66f7fd9?feedback_type=irrelevant_answer)
 | 
[Incorrect](https://app.dosu.dev/response-feedback/94786185-8a80-4c4b-b583-2ab4b66f7fd9?feedback_type=incorrect_sources)
 | 
[Verbose](https://app.dosu.dev/response-feedback/94786185-8a80-4c4b-b583-2ab4b66f7fd9?feedback_type=too_verbose)
 | 
[Hallucination](https://app.dosu.dev/response-feedback/94786185-8a80-4c4b-b583-2ab4b66f7fd9?feedback_type=hallucination)
 | [Report 
🐛](https://app.dosu.dev/response-feedback/94786185-8a80-4c4b-b583-2ab4b66f7fd9?feedback_type=bug_report)
 | 
[Other](https://app.dosu.dev/response-feedback/94786185-8a80-4c4b-b583-2ab4b66f7fd9?feedback_type=other)</sup>&nbsp;&nbsp;[![Chat
 with 
Dosu](https://dosu.dev/dosu-chat-badge.svg)](https://app.dosu.dev/b4e8e847-d479-4541-83a8-d88d83fea5c9/ask?utm_source=githu
 b)&nbsp;[![Join 
Discord](https://img.shields.io/badge/join-5865F2?logo=discord&logoColor=white&label=)](https://go.dosu.dev/discord-bot)&nbsp;[![Share
 on 
X](https://img.shields.io/badge/X-share-black)](https://twitter.com/intent/tweet?text=%40dosu_ai%20helped%20me%20solve%20this%20issue!&url=https%3A//github.com/apache/incubator-devlake/issues/8529)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@devlake.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Reply via email to