JAORMX opened a new issue, #8886:
URL: https://github.com/apache/incubator-devlake/issues/8886
The GitHub plugin writes `issues.creator_id` and `pull_requests.author_id`
for every author it sees, but only creates the corresponding `accounts` row
when that user has also committed code to the tracked repo. Anyone who only
filed an issue, or who opened a PR without commits to the repo, ends up with a
foreign key that points to nothing. A `LEFT JOIN accounts ON issues.creator_id
= accounts.id` returns NULL even though the login is sitting one column over in
`issues.creator_name`.
### Reproduction
1. Configure the GitHub plugin against a public repo with external issue
traffic. `stacklok/toolhive` is a convenient one.
2. Pick an issue filed by someone with no commits in that repo. In
`stacklok/toolhive`, `#4297` (filed by `milichev`), `#4301`, `#4359`, and
`#4413` all qualify.
3. After a full sync, query the domain layer:
```sql
SELECT i.url, i.creator_id, i.creator_name, a.user_name
FROM issues i
LEFT JOIN accounts a ON a.id = i.creator_id
WHERE i.url = 'https://github.com/stacklok/toolhive/issues/4297';
```
`creator_id` is populated, `creator_name` is `milichev`, and `a.user_name`
is NULL. Expected: `a.user_name` is `milichev`. The same shape affects
`pull_requests.author_id` for PRs from non-committer authors.
### Root cause
`backend/plugins/github/tasks/issue_extractor.go` emits a
`GithubRepoAccount` for `body.User` via `convertAccount(...)` but does not emit
a `GithubAccount` row. `_tool_github_accounts` is populated separately by
`account_collector.go`, which only iterates
`/repos/{owner}/{repo}/contributors`. Users who never committed never land in
`_tool_github_accounts`.
`account_convertor.go::ConvertAccounts` converts only what is in
`_tool_github_accounts`:
```go
dal.Select("_tool_github_accounts.*"),
dal.From(&models.GithubAccount{}),
dal.Where("repo_github_id = ? and _tool_github_accounts.connection_id=?",
...),
dal.Join(`left join _tool_github_repo_accounts gra on (...)`),
```
The `LEFT JOIN _tool_github_repo_accounts` with a `WHERE` on the right-table
column behaves like an inner join, so users present only in
`_tool_github_repo_accounts` (the rows `issue_extractor` inserts via
`convertAccount`) get filtered out.
`issue_convertor.go::ConvertIssues` writes `domainIssue.CreatorId =
accountIdGen.Generate(connId, issue.AuthorId)` unconditionally for any non-zero
AuthorId, producing the orphan FK. `pr_extractor.go` has the same pattern, with
the same outcome for PR authors and merged-by users.
The denormalized columns `issues.creator_name` and
`pull_requests.author_name` are populated correctly from the API response, so
the login is recoverable. Any consumer that joins to `accounts` for author
attribution will silently under-attribute external contributors. Bot filters
that key on `accounts.user_name LIKE '%[bot]'` will miss bot-filed issues for
the same reason.
Reproduced on `v1.0.3-beta10`. The same code paths exist on `main`; the
recent refactor of `ConvertAccounts` from `NewDataConverter` to
`NewStatefulDataConverter` did not change the filter. Likely present in every
prior release.
### Suggested fix
Have the issue and PR extractors emit a `GithubAccount` alongside the
`GithubRepoAccount`. In `issue_extractor.go`, where `body.User` is processed:
```go
if body.User != nil {
githubIssue.AuthorId = body.User.Id
githubIssue.AuthorName = body.User.Login
repoUser, err := convertAccount(body.User, data.Options.GithubId,
data.Options.ConnectionId)
if err != nil { return nil, err }
results = append(results, repoUser)
results = append(results, &models.GithubAccount{
ConnectionId: data.Options.ConnectionId,
Id: body.User.Id,
Login: body.User.Login,
})
}
```
Same change in `pr_extractor.go` for `body.User` and `body.MergedBy`. The
upsert into `_tool_github_accounts` is idempotent: if the user later shows up
as a contributor, `account_extractor` overwrites with the richer record. The
alternative is to rewrite the `ConvertAccounts` query to UNION in author IDs
from `_tool_github_issues` and `_tool_github_pull_requests`, which is larger
and changes the contract of that pipeline.
### Workaround for consumers
`COALESCE(NULLIF(a.user_name, ''), i.creator_name)` for issues, or
`pr.author_name` for PRs. The denormalized columns are populated before the
orphan-FK chain runs, so the login is always there.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]