mfrancisc opened a new issue, #8947:
URL: https://github.com/apache/devlake/issues/8947

   ### Search before asking
   
   - [x] I had searched in the 
[issues](https://github.com/apache/incubator-devlake/issues?q=is%3Aissue) and 
found no similar issues.
   
   
   ### What happened
   
   When the same physical repository is added to DevLake under more than one 
connection - which the UI currently allows without any warning - every entity 
collected from that repository (pull requests, issues, board associations, 
repo-commit links) is stored as a separate record for each connection.
   For example a repository connected via four connections, every pull request 
appears four times in the pull_requests table with four different primary keys. 
Any metric computed over these tables - PR count, cycle time, throughput, DORA 
lead time - is inflated by the number of connections pointing at the same repo.
   **The UI gives no indication that this configuration will produce duplicate 
data**. Users following the suggested multi-connection workaround 
[issue#7684](https://github.com/apache/devlake/issues/7684#issuecomment-2198868619)
 are silently creating corrupted metrics.
   
   **Verification**: All duplicate records share the same url field (e.g., 
https://github.com/owner/repo/pull/123). Running the following query confirms 
the problem:
   
   
   ```
   SELECT url, COUNT(*) as copies
   FROM pull_requests
   GROUP BY url
   HAVING COUNT(*) > 1
   ORDER BY copies DESC;
   ```
   
   
   ### What do you expect to happen
   
   When a user adds a repository scope that is already registered under a 
different connection (detected by matching _html_url_ / _clone_url_ across 
connections), the UI should display a clear warning before the user saves, for 
example:
   "_This repository is already connected via Connection 'GitHub Production'. 
Collecting it here will create duplicate pull requests and issue records, which 
will inflate all metrics for this repository._"
   
   
   The warning should not block the action - there are legitimate reasons to 
have the same repository under multiple connections (different scope configs, 
different team tokens). But the user should be able to make an informed choice.
   
   Additionally, a backend diagnostics endpoint would help existing 
installations detect the problem:
   ```
   GET /api/scope-duplicates
   ```
   
   Returns a list of repository URLs that appear under more than one 
connection, along with the affected connection IDs, so administrators can audit 
and clean up existing configurations.
   
   ### How to reproduce
   
   1. Add the same GitHub repository to DevLake under two different connections.
   2. Run blueprints for both connections.
   3. Query pull_requests grouped by url - every PR will appear twice.
   4. Note that at no point during configuration does the UI warn about this.
   
   
   ### Anything else
   
   ### Proposed implementation
   
   **Backend** - one new API handler that queries _tool_github_repos (and 
equivalent tables for other plugins) grouped by html_url, returning repos that 
appear under more than one connection:
   
   `GET /api/plugins/github/scope-duplicates`
   
   
   **Config-UI** - when a user selects a repository scope in the blueprint or 
connection wizard, call the endpoint and render a dismissible warning banner if 
the selected repo URL is already registered elsewhere.
   
   **Additional context**
   This issue affects all data-source plugins that support multiple connections 
to the same platform instance (GitHub, GitLab, Bitbucket, etc.).
   A related workaround exists: deduplicating views over the domain tables 
using url as a natural key. We are willing to contribute that as a stopgap 
alongside the UI fix if it would be useful to the project ( see : 
https://github.com/konflux-ci/devlake/pull/106 )
   
   
   ### Version
   
   main
   
   ### Are you willing to submit PR?
   
   - [x] Yes I am willing to submit a PR!
   
   ### Code of Conduct
   
   - [x] I agree to follow this project's [Code of 
Conduct](https://www.apache.org/foundation/policies/conduct)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to