[PR] Fix Pagination Bug in Github GraphQL Job Collector and add configurable page and batch size [incubator-devlake]

via GitHub Fri, 17 Oct 2025 12:12:46 -0700


FlomoN opened a new pull request, #8616:
URL: https://github.com/apache/incubator-devlake/pull/8616


   <!--
   Licensed to the Apache Software Foundation (ASF) under one or more
   contributor license agreements.  See the NOTICE file distributed with
   this work for additional information regarding copyright ownership.
   The ASF licenses this file to You under the Apache License, Version 2.0
   (the "License"); you may not use this file except in compliance with
   the License.  You may obtain a copy of the License at
   
       http://www.apache.org/licenses/LICENSE-2.0
   
   Unless required by applicable law or agreed to in writing, software
   distributed under the License is distributed on an "AS IS" BASIS,
   WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
   See the License for the specific language governing permissions and
   limitations under the License.
   -->
   ### ⚠️ Pre Checklist
   
   > Please complete _ALL_ items in this checklist, and remove before submitting
   
   - [x ] I have read through the [Contributing 
Documentation](https://devlake.apache.org/community/).
   - [ ] I have added relevant tests. -> Will do
   - [ ] I have added relevant documentation. -> Will do. I guess on the 
website? Could you point me to the right file?
   - [x] I will add labels to the PR, such as `pr-type/bug-fix`, 
`pr-type/feature-development`, etc.
   
   <!--
   Thanks for submitting a pull request!
   
   We appreciate you spending the time to work on these changes.
   Please fill out as many sections below as possible.
   -->
   
   ### Summary
   
   This PR fixes the Pagination bug from #8615 and adds the ability to adjust 
the collection paramaters of the Github GraphQL Job collector through 
environment variables. 
   
   The bug was caused by a pagination implementation for workflow jobs in 
`job_collector.go` that didn't play well together with the simultaneous 
batching of multiple workflow runs. In `getPageInfo` the first workflow run 
that had `HasNextPage` true returned it's `EndCursor` which was then used for 
pagination for all Workflow runs in the batch, but only worked for the one it 
came from, therefore missing all the pages of other workflow runs in the same 
batch.
   
   ```go
   func getPageInfo(query interface{}, args *helper.GraphqlCollectorArgs) 
(*helper.GraphqlQueryPageInfo, error) {
        queryWrapper := query.(*GraphqlQueryCheckRunWrapper)
        hasNextPage := false
        endCursor := ""
        for _, node := range queryWrapper.Node {
                if node.CheckSuite.CheckRuns.PageInfo.HasNextPage {
                        hasNextPage = true
                        endCursor = 
node.CheckSuite.CheckRuns.PageInfo.EndCursor // <- This cursor will be used for 
all workflow runs, since only one skipCursor variable exists.
                        break // <- Then stops after the first was found
                }
        }
        return &helper.GraphqlQueryPageInfo{
                EndCursor:   endCursor,
                HasNextPage: hasNextPage,
        }, nil
   }
   ```
   
   Since some people reported having timeout issues with this plugin 
(@ClaudioMascaro and @robaca) for large repositories with large workflows I 
didn't want to take away either option. But combining pagination and batching 
at the same time didn't seem feasible without adding a lot of hard to maintain 
complexity to it (like keeping track of all produced EndCursors and submitting 
them to following collection, but only for those workflow runs that had more 
pages...).
   
   I ended up implementing a configurable mode switch:
   
   - Either you choose **BATCHING** mode, if you want a little bit better 
performance while collecting to reduce graphQL API calls, when your Workflow 
Runs dont have more than 20 Jobs (this number is also adjustable however).
   - OR you choose **PAGINATING** mode, if your workflow runs typically have 
more than 20 Jobs and would potentially cause timeouts of the graphql API when 
trying to fetch all jobs at once.
   
   I think Rate Limit wise it doesnt make a difference (If I understood that 
point system correctly), since both ways would in total consume about the same 
amount of points.
   
   Since it was suggested in #8469 to have the batch size and page size 
configurable via environment variable, I also added this to give more 
flexibility to the users in finding the sweet spot for their setup.
   
   ### Does this close any open issues?
   Closes #8615
   
   ### Screenshots
   Include any relevant screenshots here.
   
   ### Other Information
   
   Maybe it would also make sense to add these configuration options to Scope 
Config and thus make them configurable on a per scope config basis, but this 
would also need migration scripts and so on. So maybe an improvement for the 
future?
   
   I will add tests aswell in the following days, if this approach seems 
appropriate to you?
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[PR] Fix Pagination Bug in Github GraphQL Job Collector and add configurable page and batch size [incubator-devlake]

Reply via email to