lukasgomez opened a new issue, #2121: URL: https://github.com/apache/incubator-devlake/issues/2121
### Search before asking - [X] I had searched in the [issues](https://github.com/apache/incubator-devlake/issues?q=is%3Aissue) and found no similar issues. ### What happened Exploring the structure of the database and the content of the tables once we scanned few repositories, we realized that the values of the column `author_id` in the table commits, contains the same value of column `author_email`. We compared this with another table and in pull_requests table there is a column also called `author_id`, but its content is the unique GitHub id of the user that created the pull request. ### What you expected to happen We expected that the content of `author_id` in Commits had the same value as `author_id` in pull_requests (ex: github:GithubUser:11111111). The design of the table commits contains two columns with the same values which means had duplicated values in the table, and it is not an ideal design. Have this unique identifier in the table commits will help to create dashboards in Grafana to obtain more exactly metrics. Without an unique author_id on commits table, it's not possible obtain all the commits of an individual as now each combination of email + display name makes a different commiter. Actual behavior: | author_name | author_email | author_id | | ------------- | ------------- | ------------- | | Jon Doe | [email protected] | [email protected] | | Jon Doe | [email protected] | [email protected] | (Both emails belong to the same person) Expected behavior: | author_name | author_email | author_id | | ------------- | ------------- | ------------- | | Jon Doe | [email protected] | github:GithubUser:11111111 | | Jon Doe | [email protected] | github:GithubUser:11111111 | ### How to reproduce 1. Have a GitHub connection configured with a token 2. Go to **Pipelines > Create Pipeline Run**. 3. Click on **Create Pipeline Run**. 4. Scroll down to until '**Github**' is shown in Data Providers list. 5. **Toggle on** GitHub Data provider 6. Enter repository owner and name for a repository that contains few commits and pull requests created by different users. 7. Click on '**Run Pipeline**' 8. Once the register of the repository have finished, go to **Pipelines > Create Pipeline Run**. 9. Click on **Create Pipeline Run** 10. Scroll down to until '**Advanced Mode**' option at the bottom appears. 11. Click on '**Advanced Mode**'. 12. Create a task in the task editor to launch a GitHub Extractor Task. 13. Use the following JSON: `[ [ { "Plugin": "gitextractor", "Options": { "url": "Url of the repository registered in the previous step ended with .git", "repoId": "Github repository id. It looks like -> github:GithubRepo:384111310", "user": "Name of the user who is the owner of the GitHub Token", "password": "GitHub Token" } } ]` 14. Click on '**Run Pipeline**' 15. Once the scan have finished, connect to the database 16. Use the table `commits` 17. Execute the query: `SELECT author_email, author_id FROM lake.commits;` ### Anything else Not sure about the exact version of lake we are using, because we are working with the fork of [MericoDev](https://hub.docker.com/layers/lake/mericodev/lake/20220523/images/sha256-b2210658d04[%E2%80%A6]ea0ae55f865c8ad520b628e9a94a76d492045097524bc?context=explore). ### Version 0.10.0 ### Are you willing to submit PR? - [ ] Yes I am willing to submit a PR! ### Code of Conduct - [X] I agree to follow this project's [Code of Conduct](https://www.apache.org/foundation/policies/conduct) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
