Re: [I] [Bug][GitHub Data source] pull_requests table does not populate all the repos added to the project [incubator-devlake]

via GitHub Fri, 24 May 2024 00:46:42 -0700


dosubot[bot] commented on issue #7508:
URL: 
https://github.com/apache/incubator-devlake/issues/7508#issuecomment-2128826930


   <!-- Answer -->
   The "pull_requests" table might not populate all the repos added to the 
project after a successful pipeline run because the `CollectPrs` function 
relies on the `GithubApiParams` struct, which includes the `ConnectionId` and 
`Name` parameters. If these parameters are not correctly set or if the new 
repositories are not included in the scope defined by these parameters, the 
pull requests for those repositories will not be extracted and populated into 
the table.
   
   Additionally, the `CollectPrs` function uses a GraphQL query to collect pull 
requests, and it orders the results by `CREATED_AT` and uses cursor pagination. 
This approach may miss some pull requests that are updated during the 
collection process. Although these missed pull requests will be collected in 
the next run, it might not be sufficient for immediate data consistency.
   
   Here is the relevant code snippet from the `CollectPrs` function:
   
   ```go
   func CollectPrs(taskCtx plugin.SubTaskContext) errors.Error {
       data := taskCtx.GetData().(*tasks.GithubTaskData)
       var err errors.Error
       apiCollector, err := api.NewStatefulApiCollector(api.RawDataSubTaskArgs{
           Ctx: taskCtx,
           Params: tasks.GithubApiParams{
               ConnectionId: data.Options.ConnectionId,
               Name:         data.Options.Name,
           },
           Table: RAW_PRS_TABLE,
       })
       if err != nil {
           return err
       }
   
       err = apiCollector.InitGraphQLCollector(api.GraphqlCollectorArgs{
           GraphqlClient: data.GraphqlClient,
           PageSize:      10,
           BuildQuery: func(reqData *api.GraphqlRequestData) (interface{}, 
map[string]interface{}, error) {
               query := &GraphqlQueryPrWrapper{}
               if reqData == nil {
                   return query, map[string]interface{}{}, nil
               }
               ownerName := strings.Split(data.Options.Name, "/")
               variables := map[string]interface{}{
                   "pageSize":   graphql.Int(reqData.Pager.Size),
                   "skipCursor": (*graphql.String)(reqData.Pager.SkipCursor),
                   "owner":      graphql.String(ownerName[0]),
                   "name":       graphql.String(ownerName[1]),
               }
               return query, variables, nil
           },
           GetPageInfo: func(iQuery interface{}, args 
*api.GraphqlCollectorArgs) (*api.GraphqlQueryPageInfo, error) {
               query := iQuery.(*GraphqlQueryPrWrapper)
               return query.Repository.PullRequests.PageInfo, nil
           },
           ResponseParser: func(iQuery interface{}, variables 
map[string]interface{}) ([]interface{}, error) {
               query := iQuery.(*GraphqlQueryPrWrapper)
               prs := query.Repository.PullRequests.Prs
               for _, rawL := range prs {
                   if apiCollector.GetSince() != nil && 
!apiCollector.GetSince().Before(rawL.CreatedAt) {
                       return nil, api.ErrFinishCollect
                   }
               }
               return nil, nil
           },
       })
       if err != nil {
           return err
       }
   
       return apiCollector.Execute()
   }
   ```
   
   Ensure that the `ConnectionId` and `Name` parameters are correctly set to 
include all the repositories you want to track. Additionally, consider the 
limitations of the current pagination and ordering strategy, and plan for 
potential updates to the collection logic to ensure all relevant pull requests 
are captured 
<sup>[[1]](https://github.com/apache/incubator-devlake/blob/main/backend/plugins/github/tasks/pr_extractor.go)</sup><sup>[[2]](https://github.com/apache/incubator-devlake/blob/main/backend/python/plugins/azuredevops/azuredevops/streams/pull_requests.py)</sup><sup>[[3]](https://github.com/apache/incubator-devlake/blob/main/backend/plugins/github_graphql/tasks/pr_collector.go)</sup>.
   
   
   <!-- Dosu Comment Footer -->
   <details>
   <summary>
   
   #### Details
   </summary>
   
   **About Dosu**
   This response is meant to be useful and save you time. It is **not meant to 
be a precise solution**, but rather a starting point for your own research.
   
   Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is 
irrelevant.
   </details>
   
   *To continue the conversation, mention @dosu.*
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@devlake.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [I] [Bug][GitHub Data source] pull_requests table does not populate all the repos added to the project [incubator-devlake]

Reply via email to