I have a CRON job scraping some content from GitHub every night (about 20k small files). It worked well for a year, but recently something changed such that after a few minutes GitHub stars giving a lot of 403 and then after another minute I start getting thousands of these:
HTTP/2 stream 20135 was not closed cleanly before end of the underlying stream HTTP/2 stream 20137 was not closed cleanly before end of the underlying stream HTTP/2 stream 20139 was not closed cleanly before end of the underlying stream So either they introduced a server bug, or perhaps GitHub is deliberately blocking abusive behavior due to high concurrency. I am using a multi handle with CURLPIPE_MULTIPLEX and otherwise default settings. Am I correct that this means libcurl starts 100 concurrent streams (CURLMOPT_MAX_CONCURRENT_STREAMS), and still make 6 concurrent connections (CURLMOPT_MAX_HOST_CONNECTIONS) per host, i.e. download 600 files in parallel? I can imagine that could be considered abusive. Should I set CURLMOPT_MAX_HOST_CONNECTIONS to 1 in case of http/2 multiplexing? Or is CURLMOPT_MAX_HOST_CONNECTIONS ignored in case of multiplexing? One other thing I noticed is that GitHub does not seem to set any MAX_CONCURRENT_STREAMS, or at least I am not seeing any. For example on httpbin I see this: curl -v 'https://httpbin.org/get' --http2 * Connection state changed (MAX_CONCURRENT_STREAMS == 128)! However for GitHub I don't see such a thing: curl -v 'https://raw.githubusercontent.com/curl/curl/master/README' --http2 So does this mean libcurl will assume 100 streams is OK? Is there a way to debug this, and monitor how many active downloads a multi-handle is making in total (summed over all connections)? Afaik, the value 'running_handles' from curl_multi_perform() gives me the total uncompleted requests, including those that have not started yet, so that does not tell me how many are in progress? -- Unsubscribe: https://lists.haxx.se/listinfo/curl-library Etiquette: https://curl.se/mail/etiquette.html