GitHub user dosubot[bot] added a comment to the discussion: Transient 502s & Incremental Cursor Gaps: Scaling github_graphql for 300+ Repos
<!-- Greeting --> Hi @zaiddialpad! I'm [Dosu](https://go.dosu.dev/dosubot) and I’m helping the incubator-devlake team. <!-- Answer --> Great questions — I dug into the codebase to give you concrete answers on all three. **1. Retry Logic: Yes, it exists, but with caveats.** The `GraphqlAsyncClient` *does* retry HTTP-level errors including 502 and 503 [[1]](https://github.com/apache/incubator-devlake/blob/2c15ad880dce830c6512fa58039527cbf4745b8b/backend/helpers/pluginhelper/api/graphql_async_client.go#L158-L189). It uses a **fixed backoff** (not exponential) controlled by two env vars: | Variable | Default | What it does | |---|---|---| | `API_RETRY` | `3` | Max retry attempts per GraphQL request | | `API_TIMEOUT` | `30s` | Wait duration *between* retries (for the GraphQL client specifically) | So if you're seeing 502s kill tasks after retries, you may want to bump `API_RETRY` higher (e.g., `5` or `7`) given your scale. Note that GraphQL-level `DataError` responses (schema/validation errors) are *not* retried — only transport-level failures are. A closely related issue was reported in [#8614](https://github.com/apache/incubator-devlake/issues/8614) [[2]](https://github.com/apache/incubator-devlake/issues/8614) where 502s and stream cancellations were caused by GitHub's server-side response size limits on large repos. The fix (merged in v1.0.3-beta8) introduced `GITHUB_JOB_GRAPHQL_PAGE_SIZE` to reduce page sizes. If you haven't tuned this, it may help reduce the frequency of 502s. **⚠️ Heads-up on two active bugs that compound this:** - The GraphQL client's background rate-limit goroutine calls `panic(err)` on *any* error (including a transient 502 during rate-limit polling), which **crashes the entire DevLake process** [[3]](https://github.com/apache/incubator-devlake/blob/2c15ad880dce830c6512fa58039527cbf4745b8b/backend/helpers/pluginhelper/api/graphql_async_client.go#L127-L131). This is tracked in [#8788](https://github.com/apache/incubator-devlake/issues/8788) [[4]](https://github.com/apache/incubator-devlake/issues/8788). - If you're authenticating via a GitHub App, the GraphQL client uses a `StaticTokenSource` that **freezes the token at task start**. After ~1 hour when it expires, all GraphQL requests fail with 401 and trigger the panic above. The REST client was already fixed for this, but the GraphQL client was not [[4]](https://github.com/apache/incubator-devlake/issues/8788). Both are addressed in the open [PR #8791](https://github.com/apache/incubator-devlake/pull/8791) [[5]](https://github.com/apache/incubator-devlake/pull/8791) — not yet merged. Given your GitHub App auth + 340 repos (likely >1 hour syncs), this is almost certainly hitting you. **2. Cursor Commitment: Strict all-or-nothing — your concern about gaps is partially valid.** The `CollectorStateManager.Close()` that persists `LatestSuccessStart` is **only called when all nested collectors succeed** [[6]](https://github.com/apache/incubator-devlake/blob/2c15ad880dce830c6512fa58039527cbf4745b8b/backend/helpers/pluginhelper/api/api_collector_stateful.go#L94-L104) [[7]](https://github.com/apache/incubator-devlake/blob/2c15ad880dce830c6512fa58039527cbf4745b8b/backend/helpers/pluginhelper/api/collector_state_manager.go#L108-L119). If a subtask fails at any point, `Close()` is never reached, so the cursor stays at its previous position. On the next run, collection restarts from the same `since` timestamp. Here's the nuance on data gaps: - **Raw data** collected before the failure **persists** in the `_raw_github_graphql_*` tables (no rollback) [[8]](https://github.com/apache/incubator-devlake/blob/2c15ad880dce830c6512fa58039527cbf4745b8b/backend/helpers/pluginhelper/api/graphql_collector.go#L296-L308). - On the next incremental run, the raw table is **not** cleared, so the re-run will produce **duplicate raw rows** for data already collected in the failed run. - The GraphQL pagination cursor (`EndCursor`) is **entirely in-memory** — if a subtask fails on page 47 of 100, the next run restarts from page 1. - **No data should be silently skipped**, because the `since` timestamp doesn't advance on failure. The risk is duplicates, not gaps. The extractor/converter pipeline should handle deduplication at the domain layer via primary keys, though it's worth verifying this for your specific data models. **3. Recommended Workarounds at Scale** - **Increase retries**: Set `API_RETRY=5` or higher and consider increasing `API_TIMEOUT` (the inter-retry wait) to give GitHub more breathing room. - **Reduce page sizes**: Set `GITHUB_JOB_GRAPHQL_PAGE_SIZE` to a smaller value (e.g., 20 or 30) to reduce the chance of GitHub returning 502 on large responses. - **Watch for PR #8791**: This is critical for your setup — it fixes the panic crash and the GitHub App token expiry issue for the GraphQL client. You may want to comment on it to signal urgency. - **Consider splitting blueprints**: Breaking your 346 repos into multiple blueprints (e.g., groups of 50-80) can reduce the blast radius of a single failure and keep individual sync times under the 1-hour token expiry window. Hope this helps make the pipelines more resilient. The retry mechanism is there but the defaults are conservative for a 340+ repo deployment, and the two bugs in #8788 are likely compounding your failures significantly. <!-- Dosu Comment Footer --> *To reply, just mention [@dosu](https://go.dosu.dev/dosubot?utm_source=github&utm_medium=bot-comment&utm_campaign=github-assistant&utm_content=brand-link).* --- <sup>How did I do? [Good](https://app.dosu.dev/response-feedback/befe9d1e-5953-4581-a571-60faa502849b?feedback_type=great_response&utm_source=github&utm_medium=bot-comment&utm_campaign=github-assistant&utm_content=feedback-great_response) | [Irrelevant](https://app.dosu.dev/response-feedback/befe9d1e-5953-4581-a571-60faa502849b?feedback_type=irrelevant_answer&utm_source=github&utm_medium=bot-comment&utm_campaign=github-assistant&utm_content=feedback-irrelevant_answer) | [Incorrect](https://app.dosu.dev/response-feedback/befe9d1e-5953-4581-a571-60faa502849b?feedback_type=incorrect_sources&utm_source=github&utm_medium=bot-comment&utm_campaign=github-assistant&utm_content=feedback-incorrect_sources) | [Verbose](https://app.dosu.dev/response-feedback/befe9d1e-5953-4581-a571-60faa502849b?feedback_type=too_verbose&utm_source=github&utm_medium=bot-comment&utm_campaign=github-assistant&utm_content=feedback-too_verbose) | [Hallucination](https://app.dosu.dev/response-feedback/befe9d1e-5953-45 81-a571-60faa502849b?feedback_type=hallucination&utm_source=github&utm_medium=bot-comment&utm_campaign=github-assistant&utm_content=feedback-hallucination) | [Report 🐛](https://app.dosu.dev/response-feedback/befe9d1e-5953-4581-a571-60faa502849b?feedback_type=bug_report&utm_source=github&utm_medium=bot-comment&utm_campaign=github-assistant&utm_content=feedback-bug_report) | [Other](https://app.dosu.dev/response-feedback/befe9d1e-5953-4581-a571-60faa502849b?feedback_type=other&utm_source=github&utm_medium=bot-comment&utm_campaign=github-assistant&utm_content=feedback-other)</sup> [](https://app.dosu.dev/b4e8e847-d479-4541-83a8-d88d83fea5c9/ask?utm_source=github&utm_medium=bot-comment&utm_campaign=github-assistant&utm_content=chat-badge) [](https://cursor.com/link/prompt?text=Great%20questions%20%E2%80%94%20I%20dug%20into%20the%20codebase%20to%20give%20you%20concrete%20answers%20on%20all%20three.%0A%0A%2A%2A1.%20Retry%20Logic%3A%20Yes%2C%20it%20exists%2C%20but%20with%20caveats.%2A%2A%0A%0AThe%20%60GraphqlAsyncClient%60%20%2Adoes%2A%20retry%20HTTP-level%20errors%20including%20502%20and%20503%20%5B%5B1%5D%5D%28https%3A//github.com/apache/incubator-devlake/blob/2c15ad880dce830c6512fa58039527cbf4745b8b/backend/helpers/pluginhelper/api/graphql_async_client.go%23L158-L189%29.%20It%20uses%20a%20%2A%2Afixed%20backoff%2A%2A%20%28not%20exponential%29%20controlled%20by%20two%20env%20vars%3A%0A%0A%7C%20Variable%20%7C%20Default%20%7C%20What%20it%20does%20%7C%0A %7C---%7C---%7C---%7C%0A%7C%20%60API_RETRY%60%20%7C%20%603%60%20%7C%20Max%20retry%20attempts%20per%20GraphQL%20request%20%7C%0A%7C%20%60API_TIMEOUT%60%20%7C%20%6030s%60%20%7C%20Wait%20duration%20%2Abetween%2A%20retries%20%28for%20the%20GraphQL%20client%20specifically%29%20%7C%0A%0ASo%20if%20you%27re%20seeing%20502s%20kill%20tasks%20after%20retries%2C%20you%20may%20want%20to%20bump%20%60API_RETRY%60%20higher%20%28e.g.%2C%20%605%60%20or%20%607%60%29%20given%20your%20scale.%20Note%20that%20GraphQL-level%20%60DataError%60%20responses%20%28schema/validation%20errors%29%20are%20%2Anot%2A%20retried%20%E2%80%94%20only%20transport-level%20failures%20are.%0A%0AA%20closely%20related%20issue%20was%20reported%20in%20%5B%238614%5D%28https%3A//github.com/apache/incubator-devlake/issues/8614%29%20%5B%5B2%5D%5D%28https%3A//github.com/apache/incubator-devlake/issues/8614%29%20where%20502s%20and%20stream%20cancellations%20were%20caused%20by%20GitHub%27s%20server-side%20response%20size%20limits%20on%20 large%20repos.%20The%20fix%20%28merged%20in%20v1.0.3-beta8%29%20introduced%20%60GITHUB_JOB_GRAPHQL_PAGE_SIZE%60%20to%20reduce%20page%20sizes.%20If%20you%20haven%27t%20tuned%20this%2C%20it%20may%20help%20reduce%20the%20frequency%20of%20502s.%0A%0A%2A%2A%E2%9A%A0%EF%B8%8F%20Heads-up%20on%20two%20active%20bugs%20that%20compound%20this%3A%2A%2A%0A%0A-%20The%20GraphQL%20client%27s%20background%20rate-limit%20goroutine%20calls%20%60panic%28err%29%60%20on%20%2Aany%2A%20error%20%28including%20a%20transient%20502%20during%20rate-limit%20polling%29%2C%20which%20%2A%2Acrashes%20the%20entire%20DevLake%20process%2A%2A%20%5B%5B3%5D%5D%28https%3A//github.com/apache/incubator-devlake/blob/2c15ad880dce830c6512fa58039527cbf4745b8b/backend/helpers/pluginhelper/api/graphql_async_client.go%23L127-L131%29.%20This%20is%20tracked%20in%20%5B%238788%5D%28https%3A//github.com/apache/incubator-devlake/issues/8788%29%20%5B%5B4%5D%5D%28https%3A//github.com/apache/incubator-devlake/issues/8788%29.%0A-%20If%20you% 27re%20authenticating%20via%20a%20GitHub%20App%2C%20the%20GraphQL%20client%20uses%20a%20%60StaticTokenSource%60%20that%20%2A%2Afreezes%20the%20token%20at%20task%20start%2A%2A.%20After%20~1%20hour%20when%20it%20expires%2C%20all%20GraphQL%20requests%20fail%20with%20401%20and%20trigger%20the%20panic%20above.%20The%20REST%20client%20was%20already%20fixed%20for%20this%2C%20but%20the%20GraphQL%20client%20was%20not%20%5B%5B4%5D%5D%28https%3A//github.com/apache/incubator-devlake/issues/8788%29.%0A%0ABoth%20are%20addressed%20in%20the%20open%20%5BPR%20%238791%5D%28https%3A//github.com/apache/incubator-devlake/pull/8791%29%20%5B%5B5%5D%5D%28https%3A//github.com/apache/incubator-devlake/pull/8791%29%20%E2%80%94%20not%20yet%20merged.%20Given%20your%20GitHub%20App%20auth%20%2B%20340%20repos%20%28likely%20%3E1%20hour%20syncs%29%2C%20this%20is%20almost%20certainly%20hitting%20you.%0A%0A%2A%2A2.%20Cursor%20Commitment%3A%20Strict%20all-or-nothing%20%E2%80%94%20your%20concern%20about%20gaps%20is%20par tially%20valid.%2A%2A%0A%0AThe%20%60CollectorStateManager.Close%28%29%60%20that%20persists%20%60LatestSuccessStart%60%20is%20%2A%2Aonly%20called%20when%20all%20nested%20collectors%20succeed%2A%2A%20%5B%5B6%5D%5D%28https%3A//github.com/apache/incubator-devlake/blob/2c15ad880dce830c6512fa58039527cbf4745b8b/backend/helpers/pluginhelper/api/api_collector_stateful.go%23L94-L104%29%20%5B%5B7%5D%5D%28https%3A//github.com/apache/incubator-devlake/blob/2c15ad880dce830c6512fa58039527cbf4745b8b/backend/helpers/pluginhelper/api/collector_state_manager.go%23L108-L119%29.%20If%20a%20subtask%20fails%20at%20any%20point%2C%20%60Close%28%29%60%20is%20never%20reached%2C%20so%20the%20cursor%20stays%20at%20its%20previous%20position.%20On%20the%20next%20run%2C%20collection%20restarts%20from%20the%20same%20%60since%60%20timestamp.%0A%0AHere%27s%20the%20nuance%20on%20data%20gaps%3A%0A%0A-%20%2A%2ARaw%20data%2A%2A%20collected%20before%20the%20failure%20%2A%2Apersists%2A%2A%20in%20the%20%60_raw_github_graphq l_%2A%60%20tables%20%28no%20rollback%29%20%5B%5B8%5D%5D%28https%3A//github.com/apache/incubator-devlake/blob/2c15ad880dce830c6512fa58039527cbf4745b8b/backend/helpers/pluginhelper/api/graphql_collector.go%23L296-L308%29.%0A-%20On%20the%20next%20incremental%20run%2C%20the%20raw%20table%20is%20%2A%2Anot%2A%2A%20cleared%2C%20so%20the%20re-run%20will%20produce%20%2A%2Aduplicate%20raw%20rows%2A%2A%20for%20data%20already%20collected%20in%20the%20failed%20run.%0A-%20The%20GraphQL%20pagination%20cursor%20%28%60EndCursor%60%29%20is%20%2A%2Aentirely%20in-memory%2A%2A%20%E2%80%94%20if%20a%20subtask%20fails%20on%20page%2047%20of%20100%2C%20the%20next%20run%20restarts%20from%20page%201.%0A-%20%2A%2ANo%20data%20should%20be%20silently%20skipped%2A%2A%2C%20because%20the%20%60since%60%20timestamp%20doesn%27t%20advance%20on%20failure.%20The%20risk%20is%20duplicates%2C%20not%20gaps.%20The%20extractor/converter%20pipeline%20should%20handle%20deduplication%20at%20the%20domain%20layer%20via%20primary%20ke ys%2C%20though%20it%27s%20worth%20verifying%20this%20for%20your%20specific%20data%20models.%0A%0A%2A%2A3.%20Recommended%20Workarounds%20at%20Scale%2A%2A%0A%0A-%20%2A%2AIncrease%20retries%2A%2A%3A%20Set%20%60API_RETRY%3D5%60%20or%20higher%20and%20consider%20increasing%20%60API_TIMEOUT%60%20%28the%20inter-retry%20wait%29%20to%20give%20GitHub%20more%20breathing%20room.%0A-%20%2A%2AReduce%20page%20sizes%2A%2A%3A%20Set%20%60GITHUB_JOB_GRAPHQL_PAGE_SIZE%60%20to%20a%20smaller%20value%20%28e.g.%2C%2020%20or%2030%29%20to%20reduce%20the%20chance%20of%20GitHub%20returning%20502%20on%20large%20responses.%0A-%20%2A%2AWatch%20for%20PR%20%238791%2A%2A%3A%20This%20is%20critical%20for%20your%20setup%20%E2%80%94%20it%20fixes%20the%20panic%20crash%20and%20the%20GitHub%20App%20token%20expiry%20issue%20for%20the%20GraphQL%20client.%20You%20may%20want%20to%20comment%20on%20it%20to%20signal%20urgency.%0A-%20%2A%2AConsider%20splitting%20blueprints%2A%2A%3A%20Breaking%20your%20346%20repos%20into%20multiple% 20blueprints%20%28e.g.%2C%20groups%20of%2050-80%29%20can%20reduce%20the%20blast%20radius%20of%20a%20single%20failure%20and%20keep%20individual%20sync%20times%20under%20the%201-hour%20token%20expiry%20window.%0A%0AHope%20this%20helps%20make%20the%20pipelines%20more%20resilient.%20The%20retry%20mechanism%20is%20there%20but%20the%20defaults%20are%20conservative%20for%20a%20340%2B%20repo%20deployment%2C%20and%20the%20two%20bugs%20in%20%238788%20are%20likely%20compounding%20your%20failures%20significantly.) [](https://go.dosu.dev/discord-bot?utm_source=github&utm_medium=bot-comment&utm_campaign=github-assistant&utm_content=join-discord) [](https://twitter.com/intent/tweet?text=%40dosu_ai%20helped%20me%20solve%20this%20issue!&url=https%3A//github.com/apache/incubator-devlake/discussions/8821) GitHub link: https://github.com/apache/incubator-devlake/discussions/8821#discussioncomment-16398435 ---- This is an automatically sent email for [email protected]. To unsubscribe, please send an email to: [email protected]
