GitHub user zaiddialpad created a discussion: Transient 502s & Incremental Cursor Gaps: Scaling github_graphql for 300+ Repos
### Environment - DevLake Version: v1.0.3-beta9 - Plugin: github_graphql - Source: GitHub Cloud (authenticated via GitHub App) - Deployment Scale: 340+ repositories running on a daily sync blueprint ### The Context First off, a huge thanks to the contributors for DevLake, it’s been a game-changer for our internal metrics. However, as we’ve scaled our deployment to handle 346+ repositories, we’ve run into two interconnected issues that are making our production dashboards a bit shaky. Currently, our daily pipelines experience 1-2 task failures every single run, primarily due to how transient errors are handled during the collection phase. ### Issue 1: GitHub GraphQL 502 Bad Gateway (Transient) We are seeing intermittent graphql query got error failures when GitHub returns a 502. While these are clearly transient on GitHub's end, the github_graphql plugin appears to treat them as fatal. In a large-scale environment, it’s statistically inevitable that at least one repo will hit a 502 during a massive sync. Because there doesn't seem to be a built-in retry mechanism for these specific HTTP codes, a single "hiccup" from GitHub kills the entire task. ### Issue 2: The "Silent" Incremental Collection Cursor Gap When a Collect subtask fails halfway through (due to the 502 mentioned above), we’ve observed the following behavior: - Partial Write: Data collected before the 502 is already committed to the raw tables. - Cursor Stalls: The incremental cursor is not advanced because the subtask failed. - The Gap: On the next scheduled run, the incremental cursor picks up from the last successful run. We are concerned that if the next run successfully skips over the "failed window" or if the logic assumes the data was already handled because it exists in the raw tables, we end up with silent data gaps in our domain tables. For an organization relying on these for DORA metrics, missing even a few PRs or Issues creates a significant trust issue with the data. ### Questions for the Maintainers - Retry Logic: Does the github_graphql plugin currently support (or have plans for) configurable retries on transient HTTP errors like 502 or 503? - Cursor Commitment: Is the incremental cursor only committed upon total subtask success? If a subtask is partially successful, is there a risk of the next run skipping the "partially collected" data window? - Workarounds: For those running DevLake at this scale (300+ repos), are there recommended configurations to mitigate these transient failures? We are currently using Advanced Mode blueprints to surgically exclude problematic subtasks, but a more automated "resiliency" setting would be ideal. We’d love to hear if others are seeing this or if there’s a specific configuration in v1.0.3 we might be missing to make these pipelines more self-healing. GitHub link: https://github.com/apache/incubator-devlake/discussions/8821 ---- This is an automatically sent email for [email protected]. To unsubscribe, please send an email to: [email protected]
