GitHub user dosubot[bot] added a comment to the discussion: Transient 502s & 
Incremental Cursor Gaps: Scaling github_graphql for 300+ Repos

<!-- Greeting -->
Hi @zaiddialpad! I'm [Dosu](https://go.dosu.dev/dosubot) and I’m helping the 
incubator-devlake team.

<!-- Answer -->
Great questions — I dug into the codebase to give you concrete answers on all 
three.

**1. Retry Logic: Yes, it exists, but with caveats.**

The `GraphqlAsyncClient` *does* retry HTTP-level errors including 502 and 503 
[[1]](https://github.com/apache/incubator-devlake/blob/2c15ad880dce830c6512fa58039527cbf4745b8b/backend/helpers/pluginhelper/api/graphql_async_client.go#L158-L189).
 It uses a **fixed backoff** (not exponential) controlled by two env vars:

| Variable | Default | What it does |
|---|---|---|
| `API_RETRY` | `3` | Max retry attempts per GraphQL request |
| `API_TIMEOUT` | `30s` | Wait duration *between* retries (for the GraphQL 
client specifically) |

So if you're seeing 502s kill tasks after retries, you may want to bump 
`API_RETRY` higher (e.g., `5` or `7`) given your scale. Note that GraphQL-level 
`DataError` responses (schema/validation errors) are *not* retried — only 
transport-level failures are.

A closely related issue was reported in 
[#8614](https://github.com/apache/incubator-devlake/issues/8614) 
[[2]](https://github.com/apache/incubator-devlake/issues/8614) where 502s and 
stream cancellations were caused by GitHub's server-side response size limits 
on large repos. The fix (merged in v1.0.3-beta8) introduced 
`GITHUB_JOB_GRAPHQL_PAGE_SIZE` to reduce page sizes. If you haven't tuned this, 
it may help reduce the frequency of 502s.

**⚠️ Heads-up on two active bugs that compound this:**

- The GraphQL client's background rate-limit goroutine calls `panic(err)` on 
*any* error (including a transient 502 during rate-limit polling), which 
**crashes the entire DevLake process** 
[[3]](https://github.com/apache/incubator-devlake/blob/2c15ad880dce830c6512fa58039527cbf4745b8b/backend/helpers/pluginhelper/api/graphql_async_client.go#L127-L131).
 This is tracked in 
[#8788](https://github.com/apache/incubator-devlake/issues/8788) 
[[4]](https://github.com/apache/incubator-devlake/issues/8788).
- If you're authenticating via a GitHub App, the GraphQL client uses a 
`StaticTokenSource` that **freezes the token at task start**. After ~1 hour 
when it expires, all GraphQL requests fail with 401 and trigger the panic 
above. The REST client was already fixed for this, but the GraphQL client was 
not [[4]](https://github.com/apache/incubator-devlake/issues/8788).

Both are addressed in the open [PR 
#8791](https://github.com/apache/incubator-devlake/pull/8791) 
[[5]](https://github.com/apache/incubator-devlake/pull/8791) — not yet merged. 
Given your GitHub App auth + 340 repos (likely >1 hour syncs), this is almost 
certainly hitting you.

**2. Cursor Commitment: Strict all-or-nothing — your concern about gaps is 
partially valid.**

The `CollectorStateManager.Close()` that persists `LatestSuccessStart` is 
**only called when all nested collectors succeed** 
[[6]](https://github.com/apache/incubator-devlake/blob/2c15ad880dce830c6512fa58039527cbf4745b8b/backend/helpers/pluginhelper/api/api_collector_stateful.go#L94-L104)
 
[[7]](https://github.com/apache/incubator-devlake/blob/2c15ad880dce830c6512fa58039527cbf4745b8b/backend/helpers/pluginhelper/api/collector_state_manager.go#L108-L119).
 If a subtask fails at any point, `Close()` is never reached, so the cursor 
stays at its previous position. On the next run, collection restarts from the 
same `since` timestamp.

Here's the nuance on data gaps:

- **Raw data** collected before the failure **persists** in the 
`_raw_github_graphql_*` tables (no rollback) 
[[8]](https://github.com/apache/incubator-devlake/blob/2c15ad880dce830c6512fa58039527cbf4745b8b/backend/helpers/pluginhelper/api/graphql_collector.go#L296-L308).
- On the next incremental run, the raw table is **not** cleared, so the re-run 
will produce **duplicate raw rows** for data already collected in the failed 
run.
- The GraphQL pagination cursor (`EndCursor`) is **entirely in-memory** — if a 
subtask fails on page 47 of 100, the next run restarts from page 1.
- **No data should be silently skipped**, because the `since` timestamp doesn't 
advance on failure. The risk is duplicates, not gaps. The extractor/converter 
pipeline should handle deduplication at the domain layer via primary keys, 
though it's worth verifying this for your specific data models.

**3. Recommended Workarounds at Scale**

- **Increase retries**: Set `API_RETRY=5` or higher and consider increasing 
`API_TIMEOUT` (the inter-retry wait) to give GitHub more breathing room.
- **Reduce page sizes**: Set `GITHUB_JOB_GRAPHQL_PAGE_SIZE` to a smaller value 
(e.g., 20 or 30) to reduce the chance of GitHub returning 502 on large 
responses.
- **Watch for PR #8791**: This is critical for your setup — it fixes the panic 
crash and the GitHub App token expiry issue for the GraphQL client. You may 
want to comment on it to signal urgency.
- **Consider splitting blueprints**: Breaking your 346 repos into multiple 
blueprints (e.g., groups of 50-80) can reduce the blast radius of a single 
failure and keep individual sync times under the 1-hour token expiry window.

Hope this helps make the pipelines more resilient. The retry mechanism is there 
but the defaults are conservative for a 340+ repo deployment, and the two bugs 
in #8788 are likely compounding your failures significantly.


<!-- Dosu Comment Footer -->
*To reply, just mention 
[@dosu](https://go.dosu.dev/dosubot?utm_source=github&utm_medium=bot-comment&utm_campaign=github-assistant&utm_content=brand-link).*

---
<sup>How did I do? 
[Good](https://app.dosu.dev/response-feedback/befe9d1e-5953-4581-a571-60faa502849b?feedback_type=great_response&utm_source=github&utm_medium=bot-comment&utm_campaign=github-assistant&utm_content=feedback-great_response)
 | 
[Irrelevant](https://app.dosu.dev/response-feedback/befe9d1e-5953-4581-a571-60faa502849b?feedback_type=irrelevant_answer&utm_source=github&utm_medium=bot-comment&utm_campaign=github-assistant&utm_content=feedback-irrelevant_answer)
 | 
[Incorrect](https://app.dosu.dev/response-feedback/befe9d1e-5953-4581-a571-60faa502849b?feedback_type=incorrect_sources&utm_source=github&utm_medium=bot-comment&utm_campaign=github-assistant&utm_content=feedback-incorrect_sources)
 | 
[Verbose](https://app.dosu.dev/response-feedback/befe9d1e-5953-4581-a571-60faa502849b?feedback_type=too_verbose&utm_source=github&utm_medium=bot-comment&utm_campaign=github-assistant&utm_content=feedback-too_verbose)
 | [Hallucination](https://app.dosu.dev/response-feedback/befe9d1e-5953-45
 
81-a571-60faa502849b?feedback_type=hallucination&utm_source=github&utm_medium=bot-comment&utm_campaign=github-assistant&utm_content=feedback-hallucination)
 | [Report 
🐛](https://app.dosu.dev/response-feedback/befe9d1e-5953-4581-a571-60faa502849b?feedback_type=bug_report&utm_source=github&utm_medium=bot-comment&utm_campaign=github-assistant&utm_content=feedback-bug_report)
 | 
[Other](https://app.dosu.dev/response-feedback/befe9d1e-5953-4581-a571-60faa502849b?feedback_type=other&utm_source=github&utm_medium=bot-comment&utm_campaign=github-assistant&utm_content=feedback-other)</sup>

[![Chat with 
Dosu](https://dosu.dev/dosu-chat-badge.svg)](https://app.dosu.dev/b4e8e847-d479-4541-83a8-d88d83fea5c9/ask?utm_source=github&utm_medium=bot-comment&utm_campaign=github-assistant&utm_content=chat-badge)&nbsp;[![Open
 in 
Cursor](https://dosu.dev/dosu-in-cursor.svg)](https://cursor.com/link/prompt?text=Great%20questions%20%E2%80%94%20I%20dug%20into%20the%20codebase%20to%20give%20you%20concrete%20answers%20on%20all%20three.%0A%0A%2A%2A1.%20Retry%20Logic%3A%20Yes%2C%20it%20exists%2C%20but%20with%20caveats.%2A%2A%0A%0AThe%20%60GraphqlAsyncClient%60%20%2Adoes%2A%20retry%20HTTP-level%20errors%20including%20502%20and%20503%20%5B%5B1%5D%5D%28https%3A//github.com/apache/incubator-devlake/blob/2c15ad880dce830c6512fa58039527cbf4745b8b/backend/helpers/pluginhelper/api/graphql_async_client.go%23L158-L189%29.%20It%20uses%20a%20%2A%2Afixed%20backoff%2A%2A%20%28not%20exponential%29%20controlled%20by%20two%20env%20vars%3A%0A%0A%7C%20Variable%20%7C%20Default%20%7C%20What%20it%20does%20%7C%0A
 
%7C---%7C---%7C---%7C%0A%7C%20%60API_RETRY%60%20%7C%20%603%60%20%7C%20Max%20retry%20attempts%20per%20GraphQL%20request%20%7C%0A%7C%20%60API_TIMEOUT%60%20%7C%20%6030s%60%20%7C%20Wait%20duration%20%2Abetween%2A%20retries%20%28for%20the%20GraphQL%20client%20specifically%29%20%7C%0A%0ASo%20if%20you%27re%20seeing%20502s%20kill%20tasks%20after%20retries%2C%20you%20may%20want%20to%20bump%20%60API_RETRY%60%20higher%20%28e.g.%2C%20%605%60%20or%20%607%60%29%20given%20your%20scale.%20Note%20that%20GraphQL-level%20%60DataError%60%20responses%20%28schema/validation%20errors%29%20are%20%2Anot%2A%20retried%20%E2%80%94%20only%20transport-level%20failures%20are.%0A%0AA%20closely%20related%20issue%20was%20reported%20in%20%5B%238614%5D%28https%3A//github.com/apache/incubator-devlake/issues/8614%29%20%5B%5B2%5D%5D%28https%3A//github.com/apache/incubator-devlake/issues/8614%29%20where%20502s%20and%20stream%20cancellations%20were%20caused%20by%20GitHub%27s%20server-side%20response%20size%20limits%20on%20
 
large%20repos.%20The%20fix%20%28merged%20in%20v1.0.3-beta8%29%20introduced%20%60GITHUB_JOB_GRAPHQL_PAGE_SIZE%60%20to%20reduce%20page%20sizes.%20If%20you%20haven%27t%20tuned%20this%2C%20it%20may%20help%20reduce%20the%20frequency%20of%20502s.%0A%0A%2A%2A%E2%9A%A0%EF%B8%8F%20Heads-up%20on%20two%20active%20bugs%20that%20compound%20this%3A%2A%2A%0A%0A-%20The%20GraphQL%20client%27s%20background%20rate-limit%20goroutine%20calls%20%60panic%28err%29%60%20on%20%2Aany%2A%20error%20%28including%20a%20transient%20502%20during%20rate-limit%20polling%29%2C%20which%20%2A%2Acrashes%20the%20entire%20DevLake%20process%2A%2A%20%5B%5B3%5D%5D%28https%3A//github.com/apache/incubator-devlake/blob/2c15ad880dce830c6512fa58039527cbf4745b8b/backend/helpers/pluginhelper/api/graphql_async_client.go%23L127-L131%29.%20This%20is%20tracked%20in%20%5B%238788%5D%28https%3A//github.com/apache/incubator-devlake/issues/8788%29%20%5B%5B4%5D%5D%28https%3A//github.com/apache/incubator-devlake/issues/8788%29.%0A-%20If%20you%
 
27re%20authenticating%20via%20a%20GitHub%20App%2C%20the%20GraphQL%20client%20uses%20a%20%60StaticTokenSource%60%20that%20%2A%2Afreezes%20the%20token%20at%20task%20start%2A%2A.%20After%20~1%20hour%20when%20it%20expires%2C%20all%20GraphQL%20requests%20fail%20with%20401%20and%20trigger%20the%20panic%20above.%20The%20REST%20client%20was%20already%20fixed%20for%20this%2C%20but%20the%20GraphQL%20client%20was%20not%20%5B%5B4%5D%5D%28https%3A//github.com/apache/incubator-devlake/issues/8788%29.%0A%0ABoth%20are%20addressed%20in%20the%20open%20%5BPR%20%238791%5D%28https%3A//github.com/apache/incubator-devlake/pull/8791%29%20%5B%5B5%5D%5D%28https%3A//github.com/apache/incubator-devlake/pull/8791%29%20%E2%80%94%20not%20yet%20merged.%20Given%20your%20GitHub%20App%20auth%20%2B%20340%20repos%20%28likely%20%3E1%20hour%20syncs%29%2C%20this%20is%20almost%20certainly%20hitting%20you.%0A%0A%2A%2A2.%20Cursor%20Commitment%3A%20Strict%20all-or-nothing%20%E2%80%94%20your%20concern%20about%20gaps%20is%20par
 
tially%20valid.%2A%2A%0A%0AThe%20%60CollectorStateManager.Close%28%29%60%20that%20persists%20%60LatestSuccessStart%60%20is%20%2A%2Aonly%20called%20when%20all%20nested%20collectors%20succeed%2A%2A%20%5B%5B6%5D%5D%28https%3A//github.com/apache/incubator-devlake/blob/2c15ad880dce830c6512fa58039527cbf4745b8b/backend/helpers/pluginhelper/api/api_collector_stateful.go%23L94-L104%29%20%5B%5B7%5D%5D%28https%3A//github.com/apache/incubator-devlake/blob/2c15ad880dce830c6512fa58039527cbf4745b8b/backend/helpers/pluginhelper/api/collector_state_manager.go%23L108-L119%29.%20If%20a%20subtask%20fails%20at%20any%20point%2C%20%60Close%28%29%60%20is%20never%20reached%2C%20so%20the%20cursor%20stays%20at%20its%20previous%20position.%20On%20the%20next%20run%2C%20collection%20restarts%20from%20the%20same%20%60since%60%20timestamp.%0A%0AHere%27s%20the%20nuance%20on%20data%20gaps%3A%0A%0A-%20%2A%2ARaw%20data%2A%2A%20collected%20before%20the%20failure%20%2A%2Apersists%2A%2A%20in%20the%20%60_raw_github_graphq
 
l_%2A%60%20tables%20%28no%20rollback%29%20%5B%5B8%5D%5D%28https%3A//github.com/apache/incubator-devlake/blob/2c15ad880dce830c6512fa58039527cbf4745b8b/backend/helpers/pluginhelper/api/graphql_collector.go%23L296-L308%29.%0A-%20On%20the%20next%20incremental%20run%2C%20the%20raw%20table%20is%20%2A%2Anot%2A%2A%20cleared%2C%20so%20the%20re-run%20will%20produce%20%2A%2Aduplicate%20raw%20rows%2A%2A%20for%20data%20already%20collected%20in%20the%20failed%20run.%0A-%20The%20GraphQL%20pagination%20cursor%20%28%60EndCursor%60%29%20is%20%2A%2Aentirely%20in-memory%2A%2A%20%E2%80%94%20if%20a%20subtask%20fails%20on%20page%2047%20of%20100%2C%20the%20next%20run%20restarts%20from%20page%201.%0A-%20%2A%2ANo%20data%20should%20be%20silently%20skipped%2A%2A%2C%20because%20the%20%60since%60%20timestamp%20doesn%27t%20advance%20on%20failure.%20The%20risk%20is%20duplicates%2C%20not%20gaps.%20The%20extractor/converter%20pipeline%20should%20handle%20deduplication%20at%20the%20domain%20layer%20via%20primary%20ke
 
ys%2C%20though%20it%27s%20worth%20verifying%20this%20for%20your%20specific%20data%20models.%0A%0A%2A%2A3.%20Recommended%20Workarounds%20at%20Scale%2A%2A%0A%0A-%20%2A%2AIncrease%20retries%2A%2A%3A%20Set%20%60API_RETRY%3D5%60%20or%20higher%20and%20consider%20increasing%20%60API_TIMEOUT%60%20%28the%20inter-retry%20wait%29%20to%20give%20GitHub%20more%20breathing%20room.%0A-%20%2A%2AReduce%20page%20sizes%2A%2A%3A%20Set%20%60GITHUB_JOB_GRAPHQL_PAGE_SIZE%60%20to%20a%20smaller%20value%20%28e.g.%2C%2020%20or%2030%29%20to%20reduce%20the%20chance%20of%20GitHub%20returning%20502%20on%20large%20responses.%0A-%20%2A%2AWatch%20for%20PR%20%238791%2A%2A%3A%20This%20is%20critical%20for%20your%20setup%20%E2%80%94%20it%20fixes%20the%20panic%20crash%20and%20the%20GitHub%20App%20token%20expiry%20issue%20for%20the%20GraphQL%20client.%20You%20may%20want%20to%20comment%20on%20it%20to%20signal%20urgency.%0A-%20%2A%2AConsider%20splitting%20blueprints%2A%2A%3A%20Breaking%20your%20346%20repos%20into%20multiple%
 
20blueprints%20%28e.g.%2C%20groups%20of%2050-80%29%20can%20reduce%20the%20blast%20radius%20of%20a%20single%20failure%20and%20keep%20individual%20sync%20times%20under%20the%201-hour%20token%20expiry%20window.%0A%0AHope%20this%20helps%20make%20the%20pipelines%20more%20resilient.%20The%20retry%20mechanism%20is%20there%20but%20the%20defaults%20are%20conservative%20for%20a%20340%2B%20repo%20deployment%2C%20and%20the%20two%20bugs%20in%20%238788%20are%20likely%20compounding%20your%20failures%20significantly.)&nbsp;[![Join
 
Discord](https://img.shields.io/badge/join-5865F2?logo=discord&logoColor=white&label=)](https://go.dosu.dev/discord-bot?utm_source=github&utm_medium=bot-comment&utm_campaign=github-assistant&utm_content=join-discord)&nbsp;[![Share
 on 
X](https://img.shields.io/badge/X-share-black)](https://twitter.com/intent/tweet?text=%40dosu_ai%20helped%20me%20solve%20this%20issue!&url=https%3A//github.com/apache/incubator-devlake/discussions/8821)

GitHub link: 
https://github.com/apache/incubator-devlake/discussions/8821#discussioncomment-16398435

----
This is an automatically sent email for [email protected].
To unsubscribe, please send an email to: [email protected]

Reply via email to