mfrancisc opened a new issue, #8947: URL: https://github.com/apache/devlake/issues/8947
### Search before asking - [x] I had searched in the [issues](https://github.com/apache/incubator-devlake/issues?q=is%3Aissue) and found no similar issues. ### What happened When the same physical repository is added to DevLake under more than one connection - which the UI currently allows without any warning - every entity collected from that repository (pull requests, issues, board associations, repo-commit links) is stored as a separate record for each connection. For example a repository connected via four connections, every pull request appears four times in the pull_requests table with four different primary keys. Any metric computed over these tables - PR count, cycle time, throughput, DORA lead time - is inflated by the number of connections pointing at the same repo. **The UI gives no indication that this configuration will produce duplicate data**. Users following the suggested multi-connection workaround [issue#7684](https://github.com/apache/devlake/issues/7684#issuecomment-2198868619) are silently creating corrupted metrics. **Verification**: All duplicate records share the same url field (e.g., https://github.com/owner/repo/pull/123). Running the following query confirms the problem: ``` SELECT url, COUNT(*) as copies FROM pull_requests GROUP BY url HAVING COUNT(*) > 1 ORDER BY copies DESC; ``` ### What do you expect to happen When a user adds a repository scope that is already registered under a different connection (detected by matching _html_url_ / _clone_url_ across connections), the UI should display a clear warning before the user saves, for example: "_This repository is already connected via Connection 'GitHub Production'. Collecting it here will create duplicate pull requests and issue records, which will inflate all metrics for this repository._" The warning should not block the action - there are legitimate reasons to have the same repository under multiple connections (different scope configs, different team tokens). But the user should be able to make an informed choice. Additionally, a backend diagnostics endpoint would help existing installations detect the problem: ``` GET /api/scope-duplicates ``` Returns a list of repository URLs that appear under more than one connection, along with the affected connection IDs, so administrators can audit and clean up existing configurations. ### How to reproduce 1. Add the same GitHub repository to DevLake under two different connections. 2. Run blueprints for both connections. 3. Query pull_requests grouped by url - every PR will appear twice. 4. Note that at no point during configuration does the UI warn about this. ### Anything else ### Proposed implementation **Backend** - one new API handler that queries _tool_github_repos (and equivalent tables for other plugins) grouped by html_url, returning repos that appear under more than one connection: `GET /api/plugins/github/scope-duplicates` **Config-UI** - when a user selects a repository scope in the blueprint or connection wizard, call the endpoint and render a dismissible warning banner if the selected repo URL is already registered elsewhere. **Additional context** This issue affects all data-source plugins that support multiple connections to the same platform instance (GitHub, GitLab, Bitbucket, etc.). A related workaround exists: deduplicating views over the domain tables using url as a natural key. We are willing to contribute that as a stopgap alongside the UI fix if it would be useful to the project ( see : https://github.com/konflux-ci/devlake/pull/106 ) ### Version main ### Are you willing to submit PR? - [x] Yes I am willing to submit a PR! ### Code of Conduct - [x] I agree to follow this project's [Code of Conduct](https://www.apache.org/foundation/policies/conduct) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
