I realize that we've already begun investigating this here but I think this would be most appropriate for the App Engine public issue tracker. The issue is leading to an increasingly specific situation and I suspect will require some exchange of code/project to reproduce the behavior you've described. We monitor that issue tracker closely.
When filing a new issue on the tracker, please link back to this thread for context while posting a link to the issue here so that others in the community can see the whole picture. - Be sure to include the latest logs for related to the *502*s. When viewing the logs in Stackdriver Logging for instance, include *All logs* rather than just *request_log* as *nginx.error*, *stderr*, *stdout* and *vm.** logs may reveal clues as to a root cause. - Mention if your are using any middleware like servlet filters that may receive request before that actual handler - Lastly, include what the CPU and/or memory usage looks like on the instance(s) at the time of the 502s. Screenshots of *Utilization *and *Memory Usage* graphs from the Developers Console will likely be sufficient I look forward to this issue report. On Wednesday, February 8, 2017 at 1:24:01 PM UTC-5, Vinay Chitlangia wrote: > > > > On Wed, Feb 8, 2017 at 10:29 PM, 'Nicholas (Google Cloud Support)' via > Google App Engine <[email protected]> wrote: > >> Hey Vinay Chitlangia, >> >> Thanks for some preliminary troubleshooting and linking this interesting >> article. App Engine runs Nginx processes to handle routes to your >> application's handlers. Handlers serving static assets for instance are >> handled by this Nginx process and the resources are served directly, thus >> bypassing the application altogether to save on precious application >> resources. >> >> The Nginx process will often serve a *502* if the application raises an >> exception, an internal API call raises an exception or if the request >> simply takes too long. As such, the status code by itself does not tell us >> much. >> >> Looking at the GAE logs for your application, I found the *502*s you >> mentioned. One thing I noticed is that they all occur from the */read* >> endpoint. From the naming, I assume this endpoint is reading some data >> from BigTable. Investigating further, perhaps you could provide some >> additional information: >> >> - What exactly is happening at the */read* endpoint? A code sample >> would be ideal if that's not too sensitive. >> >> As you surmised, we are reading some data from bigtable in this endpoint. > >> >> - What kind of error handling exists in said endpoint if the BigTable >> API returns non-success responses? >> >> The entire endpoint is in a try catch block catching Exception. In the > case of failure the exception stack trace gets written to the logs. > The first line of the endpoint is a log message signalling receiveing the > request (this was done for this debugging of course!!) > For the successful request the log message (the introductory one) gets > written. For the 502 ones never. > For requests that fail because of bigtable related errors, the logs have > the stacktrace but not for 502s. > The 502 failure requests finish in <10ms. > >> >> - >> - Can you log various steps in the */read* endpoint? This might help >> identify the progress the request reaches before the *502* is >> served. It would also help in confirming that your application is >> actually >> even getting the request as I can't currently confirm that from the logs. >> >> My best guess is that the request does not make it to the servlet. The > reason for that is that for the 100s of failed 502 logs that I have seen, > not one has the log message, which is the absolute first line in the code > of the read handler. > >> >> - >> - If said endpoint does in fact read from BigTable, what API and java >> library are you using? >> >> we are using the google provided bigtable hbase1.2 jars version 0.9.4. > >> Regarding the article you linked, while the configuration of an HTTPS >> load balancer and nginx.conf can be very important, both the load balancing >> component and nginx.conf are out of the hands of the developer with App >> Engine. Your scaling settings, health check settings and handlers in the >> app.yaml are the only rules over which you have control that affect load >> balancing and nginx rules. >> >> On Wednesday, February 8, 2017 at 11:27:43 AM UTC-5, Vinay Chitlangia >> wrote: >>> >>> Might be related: >>> >>> https://blog.percy.io/tuning-nginx-behind-google-cloud-platform-http-s-load-balancer-305982ddb340#.6k2laoada >>> >>> The symptoms mentioned in this blog >>> Somewhat moderate requests >>> No logs >>> >>> match our observations. >>> >>> I do not see the "backend_connection_closed_before_data_sent_to_client" >>> status in the logs. >>> >>> The error message for a failed request received by the client is: >>> 11:12:44.549com.yawasa.server.storage.RpcStorageService LogError: >>> <html><head><title>502 Bad Gateway</title></head><body >>> bgcolor="white"><center><h1>502 Bad >>> Gateway</h1></center><hr><center>nginx</center></body></html> ( >>> RpcStorageService.java:137 >>> <https://console.cloud.google.com/debug/fromlog?appModule=default&appVersion=1&file=RpcStorageService.java&line=137&logInsertId=589569d9000e7bf6825479e4&logNanos=1486186963359794000&nestedLogIndex=0&project=village-test> >>> ) >>> >>> The mention of nginx in the log message appears promising. We are not >>> using nginx deliberately, so I am assuming this is something happening >>> under the hood. >>> >>> On Tuesday, February 7, 2017 at 11:08:55 AM UTC+5:30, Vinay Chitlangia >>> wrote: >>>> >>>> Hi, >>>> We are seeing intermittent occurrences of 502 Bad Gateway error in our >>>> server. >>>> About 0.5% requests fail with this error. >>>> >>>> Out setup is: >>>> Flex running jetty9-compat >>>> F1 machine >>>> 1 server >>>> >>>> Our request pattern is bursty. So the server gets ~30 requests in >>>> parallel. >>>> The failures, when they happen are clustered, that is over a period of >>>> 10'ish seconds one would see 3-4 errors. >>>> >>>> The requests which complete successfully, finish in 50-100 ms, so it >>>> does not appear like the server is under major load and not able to keep >>>> up. >>>> To rule out this possibility, I started the servers with 5 replicas. >>>> However the failure percentage did not change. >>>> >>>> From the looks of it, it appears that there is some throttling or quota >>>> issue at play. I tried tweaking max-concurrent-requests param. Set it to >>>> 300, but that did not make any difference either. >>>> >>>> I do not see new instances being created at the time of failure either. >>>> >>>> >>>> The request log for the failed request: >>>> 09:57:30.686POST502262 B4 msAppEngine-Google; (+ >>>> http://code.google.com/appengine; appid: s~village-test)/read >>>> 107.178.194.3 - - [07/Feb/2017:09:57:30 +0530] "POST /read HTTP/1.1" >>>> 502 262 - "AppEngine-Google; (+http://code.google.com/appengine; ms=4 >>>> cpu_ms=0 cpm_usd=2.9279999999999998e-8 loading_request=0 instance=- >>>> app_engine_release=1.9.48 trace_id=- >>>> { >>>> protoPayload: {…} >>>> insertId: "58994cb30002335cb47fd364" >>>> httpRequest: {…} >>>> resource: {…} >>>> timestamp: "2017-02-07T04:27:30.686052Z" >>>> labels: {…} >>>> >>>> operation: {…} >>>> } >>>> >>>> Looking around at other logs at around the time of failure I see. >>>> 09:57:30.000[error] 32#32: *35107 recv() failed (104: Connection reset >>>> by peer) while reading response header from upstream, client: >>>> 169.254.160.2, server: , request: "POST /read HTTP/1.1", upstream: " >>>> http://172.17.0.4:8080/read", host: "bigtable-dev.appspot.com" >>>> AFAICT this request never made it to our servlet. >>>> >>> -- >> You received this message because you are subscribed to a topic in the >> Google Groups "Google App Engine" group. >> To unsubscribe from this topic, visit >> https://groups.google.com/d/topic/google-appengine/zHSuoxkmqjw/unsubscribe >> . >> To unsubscribe from this group and all its topics, send an email to >> [email protected]. >> To post to this group, send email to [email protected]. >> Visit this group at https://groups.google.com/group/google-appengine. >> To view this discussion on the web visit >> https://groups.google.com/d/msgid/google-appengine/ea48946b-fbd9-47af-a7b4-136493f0d583%40googlegroups.com >> >> <https://groups.google.com/d/msgid/google-appengine/ea48946b-fbd9-47af-a7b4-136493f0d583%40googlegroups.com?utm_medium=email&utm_source=footer> >> . >> >> For more options, visit https://groups.google.com/d/optout. >> > > -- You received this message because you are subscribed to the Google Groups "Google App Engine" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To post to this group, send email to [email protected]. Visit this group at https://groups.google.com/group/google-appengine. To view this discussion on the web visit https://groups.google.com/d/msgid/google-appengine/e2f0a495-82e1-4b03-b1b3-1d8355de7630%40googlegroups.com. For more options, visit https://groups.google.com/d/optout.
