zaiddialpad commented on PR #8791:
URL: 
https://github.com/apache/incubator-devlake/pull/8791#issuecomment-4200909836

   Hi — just wanted to share that we're actively hitting both issues this PR 
addresses in production, and this fix would be hugely impactful for us.
   
   ## Our setup
   
   - **DevLake v1.0.3-beta9** on GKE (Helm chart)
   - **GitHub App auth** across 4 connections (408 repos total)
   - Largest repo: ~20K PRs, ~1,200 currently OPEN
   
   ## What we're experiencing
   
   ### 1. The `updateRateRemaining` panic (the primary fix in this PR)
   
   Our lake pod is at **42 restarts** — almost entirely from this panic. The 
sequence is:
   
   1. The `github_graphql` "Collect Pull Requests" subtask enters Phase 2 
(refetching all OPEN PRs)
   2. Phase 2 sends heavy compound GraphQL queries (100 PRs per batch) that 
trigger **502 Bad Gateway** from GitHub
   3. The retry loop runs for 10+ retries × 120s each, easily exceeding 1 hour
   4. Our GitHub App token expires (~1hr lifetime)
   5. The background `updateRateRemaining` goroutine calls `/rate_limit` with 
the expired token → **401 Unauthorized**
   6. `panic()` at `graphql_async_client.go:129` → pod crash
   
   ```
   panic: non-200 OK status code: 401 Unauthorized body: "Bad credentials"
   
   goroutine 3660 [running]:
   
github.com/apache/incubator-devlake/helpers/pluginhelper/api.(*GraphqlAsyncClient).updateRateRemaining.func1()
       /app/helpers/pluginhelper/api/graphql_async_client.go:129 +0x166
   ```
   
   ### 2. GraphQL client token refresh (the second fix in this PR)
   
   Since the GraphQL client uses a `StaticTokenSource` that freezes the token 
at task start, any task running longer than ~1 hour with GitHub App auth is 
guaranteed to eventually fail. For us, the most important repo's 
`github_graphql` task takes 2-3 hours even when it succeeds, so we hit this on 
every single pipeline run.
   
   The REST client was already fixed for this via #8746, but without this PR 
the GraphQL client remains vulnerable.
   
   ## Impact on us
   
   - **Largest Repo has failed on 4 of the last 6 daily pipeline runs** due to 
this combination
   - Each failure wastes 15+ hours of pipeline time (retries + pod restarts + 
re-queuing)
   - The pod crashes affect **all** running tasks, not just the one that 
triggered the panic
   
   
   ## Summary
   
   This PR addresses exactly the two bugs compounding against us:
   1. The panic → replaced with graceful error handling ✅
   2. The static token → wired up to `RefreshRoundTripper` for automatic 
renewal ✅
   
   Would love to see this merged. Happy to help test if that would be useful — 
we have a production environment that reliably reproduces both issues on every 
nightly sync.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to