dosubot[bot] commented on issue #8523:
URL: 
https://github.com/apache/incubator-devlake/issues/8523#issuecomment-3162307562

   <!-- Answer -->
   Your suspicion is correct: the plugin uses different logic for GitLab Server 
and Cloud. For self-hosted GitLab Server, it collects all users globally via 
the `/users` endpoint for every repository (data scope), which causes the same 
user data to be ingested multiple times and leads to excessive duplication in 
the `_raw_gitlab_api_users` table. For GitLab Cloud (gitlab.com, jihulab.com), 
it collects users per project, so duplication is less severe but still possible 
if users belong to multiple projects. This behavior is determined by a simple 
string match on the API endpoint in the code, which is brittle and can lead to 
inconsistent results if more SaaS domains are introduced or endpoint URLs 
change 
[source](https://github.com/apache/incubator-devlake/blob/2c15ad880dce830c6512fa58039527cbf4745b8b/backend/plugins/gitlab/tasks/account_collector.go).
   
   For diagnostics, I suggest enabling debug logging during collection and 
capturing the actual API responses and parameters for both environments. 
Compare the number and content of user records collected from Server vs. Cloud 
to confirm the duplication pattern. Also, check if the same user IDs appear 
multiple times for different repositories in the raw table.
   
   As an interim mitigation, you can periodically deduplicate or clean up the 
`_raw_gitlab_api_users` table with a scheduled script, or limit the frequency 
of full user data collection. For large-scale integrations, consider adapting 
the workaround proposed in [issue 
#8283](https://github.com/apache/incubator-devlake/issues/8283) to collect 
users only once per pipeline for GitLab Server, which has been shown to 
significantly reduce table size and duplication.
   
   If you test with different GitLab Server versions or configurations, sharing 
your findings will help refine a long-term solution for the community. No 
merged PR addresses this yet, so feedback and testing on proposed workarounds 
are valuable.
   
   
   <!-- Dosu Comment Footer -->
   *To reply, just mention [@dosu](https://go.dosu.dev/dosubot).*
   
   ---
   <sup>How did I do? 
[Good](https://app.dosu.dev/response-feedback/a41da11d-20e9-4498-9765-5a48d84c27cc?feedback_type=great_response)
 | 
[Irrelevant](https://app.dosu.dev/response-feedback/a41da11d-20e9-4498-9765-5a48d84c27cc?feedback_type=irrelevant_answer)
 | 
[Incorrect](https://app.dosu.dev/response-feedback/a41da11d-20e9-4498-9765-5a48d84c27cc?feedback_type=incorrect_sources)
 | 
[Verbose](https://app.dosu.dev/response-feedback/a41da11d-20e9-4498-9765-5a48d84c27cc?feedback_type=too_verbose)
 | 
[Hallucination](https://app.dosu.dev/response-feedback/a41da11d-20e9-4498-9765-5a48d84c27cc?feedback_type=hallucination)
 | [Report 
🐛](https://app.dosu.dev/response-feedback/a41da11d-20e9-4498-9765-5a48d84c27cc?feedback_type=bug_report)
 | 
[Other](https://app.dosu.dev/response-feedback/a41da11d-20e9-4498-9765-5a48d84c27cc?feedback_type=other)</sup>&nbsp;&nbsp;[![Chat
 with 
Dosu](https://dosu.dev/dosu-chat-badge.svg)](https://app.dosu.dev/b4e8e847-d479-4541-83a8-d88d83fea5c9/ask?utm_source=githu
 b)&nbsp;[![Join 
Discord](https://img.shields.io/badge/join-5865F2?logo=discord&logoColor=white&label=)](https://go.dosu.dev/discord-bot)&nbsp;[![Share
 on 
X](https://img.shields.io/badge/X-share-black)](https://twitter.com/intent/tweet?text=%40dosu_ai%20helped%20me%20solve%20this%20issue!&url=https%3A//github.com/apache/incubator-devlake/issues/8523)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@devlake.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Reply via email to