JAORMX opened a new issue, #8886:
URL: https://github.com/apache/incubator-devlake/issues/8886

   The GitHub plugin writes `issues.creator_id` and `pull_requests.author_id` 
for every author it sees, but only creates the corresponding `accounts` row 
when that user has also committed code to the tracked repo. Anyone who only 
filed an issue, or who opened a PR without commits to the repo, ends up with a 
foreign key that points to nothing. A `LEFT JOIN accounts ON issues.creator_id 
= accounts.id` returns NULL even though the login is sitting one column over in 
`issues.creator_name`.
   
   ### Reproduction
   
   1. Configure the GitHub plugin against a public repo with external issue 
traffic. `stacklok/toolhive` is a convenient one.
   2. Pick an issue filed by someone with no commits in that repo. In 
`stacklok/toolhive`, `#4297` (filed by `milichev`), `#4301`, `#4359`, and 
`#4413` all qualify.
   3. After a full sync, query the domain layer:
   
   ```sql
   SELECT i.url, i.creator_id, i.creator_name, a.user_name
     FROM issues i
     LEFT JOIN accounts a ON a.id = i.creator_id
    WHERE i.url = 'https://github.com/stacklok/toolhive/issues/4297';
   ```
   
   `creator_id` is populated, `creator_name` is `milichev`, and `a.user_name` 
is NULL. Expected: `a.user_name` is `milichev`. The same shape affects 
`pull_requests.author_id` for PRs from non-committer authors.
   
   ### Root cause
   
   `backend/plugins/github/tasks/issue_extractor.go` emits a 
`GithubRepoAccount` for `body.User` via `convertAccount(...)` but does not emit 
a `GithubAccount` row. `_tool_github_accounts` is populated separately by 
`account_collector.go`, which only iterates 
`/repos/{owner}/{repo}/contributors`. Users who never committed never land in 
`_tool_github_accounts`.
   
   `account_convertor.go::ConvertAccounts` converts only what is in 
`_tool_github_accounts`:
   
   ```go
   dal.Select("_tool_github_accounts.*"),
   dal.From(&models.GithubAccount{}),
   dal.Where("repo_github_id = ? and _tool_github_accounts.connection_id=?", 
...),
   dal.Join(`left join _tool_github_repo_accounts gra on (...)`),
   ```
   
   The `LEFT JOIN _tool_github_repo_accounts` with a `WHERE` on the right-table 
column behaves like an inner join, so users present only in 
`_tool_github_repo_accounts` (the rows `issue_extractor` inserts via 
`convertAccount`) get filtered out.
   
   `issue_convertor.go::ConvertIssues` writes `domainIssue.CreatorId = 
accountIdGen.Generate(connId, issue.AuthorId)` unconditionally for any non-zero 
AuthorId, producing the orphan FK. `pr_extractor.go` has the same pattern, with 
the same outcome for PR authors and merged-by users.
   
   The denormalized columns `issues.creator_name` and 
`pull_requests.author_name` are populated correctly from the API response, so 
the login is recoverable. Any consumer that joins to `accounts` for author 
attribution will silently under-attribute external contributors. Bot filters 
that key on `accounts.user_name LIKE '%[bot]'` will miss bot-filed issues for 
the same reason.
   
   Reproduced on `v1.0.3-beta10`. The same code paths exist on `main`; the 
recent refactor of `ConvertAccounts` from `NewDataConverter` to 
`NewStatefulDataConverter` did not change the filter. Likely present in every 
prior release.
   
   ### Suggested fix
   
   Have the issue and PR extractors emit a `GithubAccount` alongside the 
`GithubRepoAccount`. In `issue_extractor.go`, where `body.User` is processed:
   
   ```go
   if body.User != nil {
       githubIssue.AuthorId = body.User.Id
       githubIssue.AuthorName = body.User.Login
       repoUser, err := convertAccount(body.User, data.Options.GithubId, 
data.Options.ConnectionId)
       if err != nil { return nil, err }
       results = append(results, repoUser)
   
       results = append(results, &models.GithubAccount{
           ConnectionId: data.Options.ConnectionId,
           Id:           body.User.Id,
           Login:        body.User.Login,
       })
   }
   ```
   
   Same change in `pr_extractor.go` for `body.User` and `body.MergedBy`. The 
upsert into `_tool_github_accounts` is idempotent: if the user later shows up 
as a contributor, `account_extractor` overwrites with the richer record. The 
alternative is to rewrite the `ConvertAccounts` query to UNION in author IDs 
from `_tool_github_issues` and `_tool_github_pull_requests`, which is larger 
and changes the contract of that pipeline.
   
   ### Workaround for consumers
   
   `COALESCE(NULLIF(a.user_name, ''), i.creator_name)` for issues, or 
`pr.author_name` for PRs. The denormalized columns are populated before the 
orphan-FK chain runs, so the login is always there.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to