lukasgomez opened a new issue, #2121:
URL: https://github.com/apache/incubator-devlake/issues/2121

   ### Search before asking
   
   - [X] I had searched in the 
[issues](https://github.com/apache/incubator-devlake/issues?q=is%3Aissue) and 
found no similar issues.
   
   
   ### What happened
   
   Exploring the structure of the database and the content of the tables once 
we scanned few repositories, we realized that the values of the column 
`author_id` in the table commits, contains the same value of column 
`author_email`. We compared this with another table and in pull_requests table 
there is a column also called `author_id`, but its content is the unique GitHub 
id of the user that created the pull request. 
   
   
   
   ### What you expected to happen
   
   We expected  that the content of `author_id` in Commits had the same value 
as `author_id` in pull_requests (ex: github:GithubUser:11111111). The design of 
the table commits contains two columns with the same values which means had 
duplicated values in the table, and it is not an ideal design. Have this unique 
identifier in the table commits will help to create dashboards in Grafana to 
obtain more exactly metrics.
   
   Without an unique author_id on commits table, it's not possible obtain all 
the commits of an individual as now each combination of email + display name 
makes a different commiter. 
   
   Actual behavior:
   | author_name | author_email               | author_id                     |
   | ------------- | ------------- | ------------- |
   | Jon Doe         | [email protected]    | [email protected]    |
   | Jon Doe         | [email protected] | [email protected] |
   
   (Both emails belong to the same person)
   
   Expected behavior:
   | author_name | author_email               | author_id |
   | ------------- | ------------- | ------------- |
   | Jon Doe         | [email protected]    | github:GithubUser:11111111 |
   | Jon Doe         | [email protected] | github:GithubUser:11111111 |
   
   ### How to reproduce
   
   1. Have a GitHub connection configured with a token
   2. Go to **Pipelines > Create Pipeline Run**.
   3. Click on **Create Pipeline Run**.
   4. Scroll down to until '**Github**' is shown in Data Providers list.
   5. **Toggle on** GitHub Data provider
   6. Enter repository owner and name for a repository that contains few 
commits and pull requests created by different users.
   7. Click on '**Run Pipeline**'
   8. Once the register of the repository have finished, go to **Pipelines > 
Create Pipeline Run**.
   9. Click on **Create Pipeline Run**
   10. Scroll down to until '**Advanced Mode**' option at the bottom appears.
   11. Click on '**Advanced Mode**'.
   12. Create a task in the task editor to launch a GitHub Extractor Task.
   13. Use the following JSON: `[
     [
       {
         "Plugin": "gitextractor",
         "Options": {
           "url": "Url of the repository registered in the previous step ended 
with .git",
           "repoId": "Github repository id. It looks like -> 
github:GithubRepo:384111310",
           "user": "Name of the user who is the owner of the GitHub Token",
           "password": "GitHub Token"
         }
       }
     ]`
   14. Click on '**Run Pipeline**'
   15. Once the scan have finished, connect to the database
   16. Use the table `commits`
   17. Execute the query: `SELECT author_email, author_id FROM lake.commits;`
   
    
   
   
   
   
   ### Anything else
   
   Not sure about the exact version of lake we are using, because we are 
working with the fork of 
[MericoDev](https://hub.docker.com/layers/lake/mericodev/lake/20220523/images/sha256-b2210658d04[%E2%80%A6]ea0ae55f865c8ad520b628e9a94a76d492045097524bc?context=explore).
 
    
   
   ### Version
   
   0.10.0
   
   ### Are you willing to submit PR?
   
   - [ ] Yes I am willing to submit a PR!
   
   ### Code of Conduct
   
   - [X] I agree to follow this project's [Code of 
Conduct](https://www.apache.org/foundation/policies/conduct)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to