Hi George, I was celebrating prematurely :( We're still facing the issue it seems. I'm trying to deploy the same health check change to our production project.
Please could you take a look at this log output? : I App Engine CreateVersion default:30-1 [email protected] {"@type":"type.googleapis.com/google.cloud.audit.AuditLog","status":{},"authenticationInfo":{"principalEmail":"[email protected]"},"requestMetadata":{"callerIp":"217.45.176.239","callerSuppliedUserAgent":"google-cloud-sdk x_Tw5K8nnjoRAqULM9PFAC2b gcloud/190.0.1 command/gcloud.app.deploy invocation-id/7… App Engine CreateVersion default:30-1 [email protected] A The instance aef-default-30--1-m6mp with Debian based image was created for version 30-1 A ====== Unexpected error during VM startup ====== A ==== Dump of VM runtime system logs follows ==== A == Output of 'docker ps -a' == A CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES A ========= rebooting. ======================== A ------------------------VM shutdown initiated----------------------- A Current app health: 0 A Beginning service lameduck. A Triggering app shutdown handlers. A Sending SIGUSR1 to fluentd to trigger a log flush. A {"textPayload":"","insertId":"5kdt4ig2zxfdhb","resource":{"type":"gae_app","labels":{"project_id":"onkho-web-app-live","version_id":"30-1","module_id":"default"}},"timestamp":"2018-02-28T22:36:35.183183909Z","labels":{"compute.googleapis.com/resource_type":"instance","compute.googleapis.com/resource… A ------------------------------------------------------------------- A --------------App was unhealthy, grabbing debug logs--------------- A Container CID file not found. A ------------------------------------------------------------------- A -------------------------Tail of app logs-------------------------- A App logs not found. A ------------------------------------------------------------------- A -------------------------Additional debug info-------------------------- A cat: /home/vmagent/onkho-web-app-live_30-1-407983590104907782.env: No such file or directory A Instance has machine type of custom-4-8192 has 4 vCPU and 8192 Mb memory. App can use Mb memory. A VM memory consumption: A total used free shared buffers cached A Mem: 8005 276 7728 0 14 117 A -/+ buffers/cache: 144 7861 A Swap: 0 0 0 A VM disk usage: A Filesystem Type 1K-blocks Used Available Use% Mounted on A rootfs rootfs 10188088 4544604 5102916 48% / A udev devtmpfs 10240 0 10240 0% /dev A /dev/sda1 ext4 10188088 4544604 5102916 48% / A /dev/sda1 ext4 10188088 4544604 5102916 48% /var/lib/docker/aufs A Processes running on the VM: A top - 22:36:35 up 0 min, 0 users, load average: 0.48, 0.13, 0.04 A Tasks: 95 total, 1 running, 94 sleeping, 0 stopped, 0 zombie A %Cpu(s): 2.1 us, 3.1 sy, 0.0 ni, 85.6 id, 9.1 wa, 0.0 hi, 0.0 si, 0.0 st A KiB Mem: 8197596 total, 287356 used, 7910240 free, 15352 buffers A KiB Swap: 0 total, 0 used, 0 free, 120448 cached A {"textPayload":"","insertId":"5kdt4ig2zxfdi3","resource":{"type":"gae_app","labels":{"module_id":"default","project_id":"onkho-web-app-live","version_id":"30-1"}},"timestamp":"2018-02-28T22:36:35.183183937Z","labels":{"compute.googleapis.com/resource_type":"instance","compute.googleapis.com/resource… A PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND A 1 root 20 0 10664 1644 1508 S 0.0 0.0 0:00.82 init A 2 root 20 0 0 0 0 S 0.0 0.0 0:00.00 kthreadd A 3 root 20 0 0 0 0 S 0.0 0.0 0:00.04 ksoftirqd/0 A 4 root 20 0 0 0 0 S 0.0 0.0 0:00.00 kworker/0:0 A 5 root 0 -20 0 0 0 S 0.0 0.0 0:00.00 kworker/0:0H A 6 root 20 0 0 0 0 S 0.0 0.0 0:00.00 kworker/u8:0 A 7 root 20 0 0 0 0 S 0.0 0.0 0:00.09 rcu_sched A 8 root 20 0 0 0 0 S 0.0 0.0 0:00.00 rcu_bh A 9 root rt 0 0 0 0 S 0.0 0.0 0:00.00 migration/0 A 10 root rt 0 0 0 0 S 0.0 0.0 0:00.00 watchdog/0 A 11 root rt 0 0 0 0 S 0.0 0.0 0:00.00 watchdog/1 A 12 root rt 0 0 0 0 S 0.0 0.0 0:00.00 migration/1 A 13 root 20 0 0 0 0 S 0.0 0.0 0:00.03 ksoftirqd/1 A 14 root 20 0 0 0 0 S 0.0 0.0 0:00.00 kworker/1:0 A 15 root 0 -20 0 0 0 S 0.0 0.0 0:00.00 kworker/1:0H A 16 root rt 0 0 0 0 S 0.0 0.0 0:00.00 watchdog/2 A 17 root rt 0 0 0 0 S 0.0 0.0 0:00.00 migration/2 A 18 root 20 0 0 0 0 S 0.0 0.0 0:00.02 ksoftirqd/2 A 19 root 20 0 0 0 0 S 0.0 0.0 0:00.00 kworker/2:0 A 20 root 0 -20 0 0 0 S 0.0 0.0 0:00.00 kworker/2:0H A 21 root rt 0 0 0 0 S 0.0 0.0 0:00.00 watchdog/3 A 22 root rt 0 0 0 0 S 0.0 0.0 0:00.00 migration/3 A 23 root 20 0 0 0 0 S 0.0 0.0 0:00.02 ksoftirqd/3 A 24 root 20 0 0 0 0 S 0.0 0.0 0:00.00 kworker/3:0 A 25 root 0 -20 0 0 0 S 0.0 0.0 0:00.00 kworker/3:0H A 26 root 0 -20 0 0 0 S 0.0 0.0 0:00.00 khelper A 27 root 20 0 0 0 0 S 0.0 0.0 0:00.00 kdevtmpfs A 28 root 0 -20 0 0 0 S 0.0 0.0 0:00.00 netns A 29 root 20 0 0 0 0 S 0.0 0.0 0:00.00 khungtaskd A 30 root 0 -20 0 0 0 S 0.0 0.0 0:00.00 writeback A 31 root 25 5 0 0 0 S 0.0 0.0 0:00.00 ksmd A 32 root 39 19 0 0 0 S 0.0 0.0 0:00.00 khugepaged A 33 root 0 -20 0 0 0 S 0.0 0.0 0:00.00 crypto A 34 root 0 -20 0 0 0 S 0.0 0.0 0:00.00 kintegrityd A 35 root 0 -20 0 0 0 S 0.0 0.0 0:00.00 bioset A 36 root 0 -20 0 0 0 S 0.0 0.0 0:00.00 kblockd A 37 root 20 0 0 0 0 S 0.0 0.0 0:00.00 kworker/1:1 A 38 root 20 0 0 0 0 S 0.0 0.0 0:00.00 kswapd0 A 39 root 0 -20 0 0 0 S 0.0 0.0 0:00.00 vmstat A 40 root 20 0 0 0 0 S 0.0 0.0 0:00.00 fsnotify_mark A 46 root 0 -20 0 0 0 S 0.0 0.0 0:00.00 kthrotld A 47 root 20 0 0 0 0 S 0.0 0.0 0:00.00 kworker/0:1 A 48 root 0 -20 0 0 0 S 0.0 0.0 0:00.00 ipv6_addrconf A 49 root 20 0 0 0 0 S 0.0 0.0 0:00.00 kworker/2:1 A 50 root 0 -20 0 0 0 S 0.0 0.0 0:00.00 deferwq A 51 root 20 0 0 0 0 S 0.0 0.0 0:00.01 kworker/u8:1 A 632 root 20 0 0 0 0 S 0.0 0.0 0:00.00 scsi_eh_0 A 633 root 0 -20 0 0 0 S 0.0 0.0 0:00.00 scsi_tmf_0 A 634 root 20 0 0 0 0 S 0.0 0.0 0:00.00 kworker/u8:2 A 635 root 20 0 0 0 0 S 0.0 0.0 0:00.00 kworker/3:1 A 645 root 0 -20 0 0 0 S 0.0 0.0 0:00.02 kworker/0:1H A 673 root 20 0 0 0 0 S 0.0 0.0 0:00.00 kworker/0:2 A 679 root 20 0 0 0 0 S 0.0 0.0 0:00.00 jbd2/sda1-8 A 680 root 0 -20 0 0 0 S 0.0 0.0 0:00.00 ext4-rsv-conver A 719 root 0 -20 0 0 0 S 0.0 0.0 0:00.00 kworker/1:1H A 726 root 0 -20 0 0 0 S 0.0 0.0 0:00.00 kworker/2:1H A 839 root 20 0 21928 3068 1956 S 0.0 0.0 0:00.09 udevd A 1167 root 0 -20 0 0 0 S 0.0 0.0 0:00.00 kworker/3:1H A 1451 root 20 0 21924 2620 1580 S 0.0 0.0 0:00.00 udevd A 1452 root 20 0 21924 2624 1580 S 0.0 0.0 0:00.00 udevd A 1484 root 20 0 0 0 0 S 0.0 0.0 0:00.00 kworker/1:2 A 1489 root 0 -20 0 0 0 S 0.0 0.0 0:00.00 kpsmoused A 1490 root 20 0 0 0 0 S 0.0 0.0 0:00.00 kworker/3:2 A 1528 root 20 0 0 0 0 S 0.0 0.0 0:00.00 kworker/2:2 A 2668 root 20 0 9968 3180 888 S 0.0 0.0 0:00.00 dhclient A 2749 root 20 0 18988 2016 1760 S 0.0 0.0 0:00.00 rpcbind A 2780 statd 20 0 23360 2496 2032 S 0.0 0.0 0:00.00 rpc.statd A 2785 root 0 -20 0 0 0 S 0.0 0.0 0:00.00 rpciod A 2787 root 0 -20 0 0 0 S 0.0 0.0 0:00.00 nfsiod A 2796 root 20 0 25312 216 4 S 0.0 0.0 0:00.00 rpc.idmapd A 3086 root 20 0 115m 2668 2304 S 0.0 0.0 0:00.03 rsyslogd A 3176 root 20 0 4132 1584 1440 S 0.0 0.0 0:00.00 acpid A 3289 ntp 20 0 34892 4272 3672 S 0.0 0.1 0:00.00 ntpd A 3360 root 20 0 20424 2152 1912 S 0.0 0.0 0:00.00 cron A 3528 root 20 0 143m 42m 35m S 0.0 0.5 0:00.59 kubelet A 3573 root 20 0 501m 32m 20m S 0.0 0.4 0:00.46 docker A 3636 root 20 0 336m 9180 7600 S 0.0 0.1 0:00.03 docker-containe A 3722 root 20 0 4112 1224 1124 S 0.0 0.0 0:00.00 startpar A 3723 root 20 0 49332 9328 2888 S 0.0 0.1 0:00.00 manage_accounts A 3750 root 20 0 53732 10m 4144 S 0.0 0.1 0:00.00 manage_accounts A 3969 root 20 0 4196 1488 1328 S 0.0 0.0 0:00.00 rc A 3978 root 20 0 12788 2576 1932 S 0.0 0.0 0:00.03 startpar A 3980 root 20 0 4196 1460 1300 S 0.0 0.0 0:00.00 google-accounts A 3983 root 20 0 4196 1564 1408 S 0.0 0.0 0:00.00 google-startup- A 4008 root 20 0 4108 1520 1432 S 0.0 0.0 0:00.15 start-stop-daem A 4011 root 20 0 9252 2356 2192 S 0.0 0.0 0:00.00 run-shutdown-sc A 4082 root 20 0 9260 2252 2088 S 0.0 0.0 0:00.00 run-scripts A 4085 root 20 0 9264 2244 2020 S 0.0 0.0 0:00.00 run-scripts A 4095 root 20 0 9268 2240 2008 S 0.0 0.0 0:00.00 run-scripts A 4096 root 20 0 4104 476 396 S 0.0 0.0 0:00.00 logger A 4098 root 20 0 9460 2508 2080 S 0.0 0.0 0:00.00 run-scripts A 4281 root 20 0 9464 2516 2084 S 0.0 0.0 0:00.00 run-scripts A 4282 root 20 0 9460 1956 1528 S 0.0 0.0 0:00.00 run-scripts A 4283 root 20 0 70732 20m 6660 S 0.0 0.3 0:00.12 python A 4321 root 20 0 21676 2240 1932 R 0.0 0.0 0:00.00 top A Additional thread and heap info may be available in app.shutdown if you are using a supported client library. A ------------------------------------------------------------------- If I can't get to the bottom of this then we'll be forced to go back to compute engine. Any help would be greatly appreciated. On Wednesday, February 28, 2018 at 10:05:36 PM UTC, Karl Tinawi wrote: > > Hi George, > > It's definitely not the code or our app's startup process as far as I'm > aware because this issue is before our docker image is even pulled or any > code of ours executed. The logs should show that clearly. > > The change I applied as per your suggested relating to the health checks > did not involve any code change whatsoever and yet the VM starts normally > now. The only difference being that I now see the new health check requests > in the logs during the startup process. > > I'll continue testing and reply with any additional findings. > > I'd be grateful if you could shed some more light on the incident in > question as I still don't have an answer as to what happened. > > > > On Wednesday, February 28, 2018 at 9:40:49 PM UTC, George (Cloud Platform > Support) wrote: >> >> Hi Karl, >> >> If your app starts up without issues, the health checks would not matter. >> There are issues with your app's start up process, that you need to >> address. For coding-related issues, you are at an advantage posting to >> stackoverflow or similar forums, where you can get expert help from active >> programmers. >> >> On 28 February 2018 at 16:23, Karl Tinawi <[email protected]> wrote: >> >>> Hi George, >>> >>> I was able to do a test deployment tonight by defining the new health >>> checks as you recommended. >>> >>> Before I continue - my test for this based on performing a deployment as >>> we see the exact same behaviour there with the VM starting up and crashing >>> as we did with the incident on Monday. >>> >>> The good news is this has seemingly completely resolved our deployment >>> issue - in that they are once again successful in a reasonable amount of >>> time, rather than timing out and failing because of the aforementioned. So >>> at this point I'm semi confident that it has also resolved the issue we >>> experienced on Monday when the VM was restarted and couldn't start up >>> again. Difficult to prove this one currently from my side. >>> >>> I replaced the legacy health_check block in our .yaml file with the >>> following:- >>> >>> liveness_check: >>> path: "/_ah/health" >>> initial_delay_sec: 300 >>> check_interval_sec: 5 >>> timeout_sec: 5 >>> failure_threshold: 3 >>> success_threshold: 1 >>> >>> readiness_check: >>> path: "/login" >>> app_start_timeout_sec: 300 >>> check_interval_sec: 30 >>> timeout_sec: 5 >>> failure_threshold: 3 >>> success_threshold: 1 >>> >>> >>> The most obvious question I have at this point, is why? Why would this >>> resolve it the issue? I can only guess that this could be related to the >>> new style health/liveness checks being enabled by default but we had not >>> executed: >>> >>> gcloud beta app update --split-health-checks --project [YOUR_PROJECT_ID] >>> >>> >>> or provided the liveness_check/readiness_check blocks in our yaml file? >>> I've only just learnt about these new updated health checks here >>> <https://cloud.google.com/appengine/docs/flexible/php/configuring-your-app-with-app-yaml#configuring_supervisord_in_the_php_runtime> >>> as >>> it's not something we keep up-to-date with once we have a desired >>> configuration so am concerned that there was a backwards compatibility >>> issue here. >>> >>> I'm performing a couple more deployments to satisfy myself that this is >>> not a fluke. >>> >>> As a side question I see these entries in our logs now since activating >>> the new health checks: >>> >>> >>> <https://lh3.googleusercontent.com/-yUuumRDccTA/WpcdBRBtY7I/AAAAAAAAABI/I_2N-EwYdWIzv6RJKaLcnxRJdyVoQs7IACLcBGAs/s1600/readiness_check.png> >>> >>> >>> <https://lh3.googleusercontent.com/-1q52sNXnYFI/WpcdFCILrbI/AAAAAAAAABM/NZrQfaazq0cwAMCYbvujUfE1l9J6iZynwCLcBGAs/s1600/liveness_check.png> >>> >>> These don't seem to be obeying the configuration I had defined (as per >>> above code snippets). Most notably the path and interval? >>> >>> I'd like to learn if I'm doing anything wrong here or if there is an >>> explanation. >>> >>> Many thanks again and looking forward to hearing from you. >>> >>> Karl >>> >>> >>> >>> On Wednesday, February 28, 2018 at 8:12:09 PM UTC, Karl Tinawi wrote: >>>> >>>> Hi George, >>>> >>>> Yes that's correct - it's happened once outside of deployments. >>>> >>>> To answer your questions sir: >>>> >>>> - We require a custom PHP installation in order to make use of >>>> modules that are missing from Google's offering. I've not checked the >>>> latest list of extension but it may be that we may be able to move back >>>> to >>>> using the standard PHP image so I'll check this for sure. >>>> - Scaling is another challenge that we're looking and we're >>>> certainly aware that we need to move to auto scaling for contingency >>>> etc... >>>> - I'll test configuring the readiness check and report back if we >>>> notice any difference in behaviour. >>>> >>>> Were the logs helpful? I'd be grateful if you could shed some light on >>>> the investigation your end. This is the first time we've noticed an issue >>>> such as this during the maintenance process, which should be innocuous and >>>> invisible to us. >>>> >>>> At this point I'm unsure if the issues we face during deployments are >>>> related to the incident that happened with our running app, which continue >>>> to occur daily. It's worth noting that the behaviour of the VM is >>>> identical >>>> (in the way of the abrupt restarts as it's trying to boot). I may look at >>>> trying a test deployment using another image and seeing if that helps. >>>> >>>> >>>> Many thanks again, >>>> >>>> Karl >>>> >>>> >>>> On Wednesday, February 28, 2018 at 12:52:06 AM UTC, George (Cloud >>>> Platform Support) wrote: >>>>> >>>>> Hello Karl, >>>>> >>>>> You seem to indicate that the outage is a one-time event, and that >>>>> there is no other similar occurrence as yet. If this is so, to prevent >>>>> similar unwanted events in future, you may configure your app for health >>>>> checks, in detail. For reference, the "Configuring your App with >>>>> app.yaml" >>>>> should prove of great help. In your app.yaml, you can specify either >>>>> liveness check (choosing appropriate parameter values): >>>>> >>>>> liveness_check: >>>>> path: "/liveness_check" >>>>> check_interval_sec: 30 >>>>> timeout_sec: 4 >>>>> failure_threshold: 2 >>>>> success_threshold: 2 >>>>> >>>>> or readiness check: >>>>> >>>>> readiness_check: >>>>> path: "/readiness_check" >>>>> check_interval_sec: 5 >>>>> timeout_sec: 4 >>>>> failure_threshold: 2 >>>>> success_threshold: 2 >>>>> app_start_timeout_sec: 300 >>>>> >>>>> It is worthwhile noting that the usual way of specifying PHP for you >>>>> app is: >>>>> >>>>> runtime: php //This setting is required. It is the name of the App >>>>> Engine language runtime used by this application. To specify PHP, use php >>>>> env: flex >>>>> >>>>> You app uses: runtime: custom , by contrast. >>>>> >>>>> You may also switch to automatic scaling from manual, and one only >>>>> instance. If this makes a difference in your app's behavior, the >>>>> information would help us with debugging. >>>>> >>>>> >>>>> -- >>> You received this message because you are subscribed to a topic in the >>> Google Groups "Google App Engine" group. >>> To unsubscribe from this topic, visit >>> https://groups.google.com/d/topic/google-appengine/5mKOn6RzRFY/unsubscribe >>> . >>> To unsubscribe from this group and all its topics, send an email to >>> [email protected]. >>> To post to this group, send email to [email protected]. >>> Visit this group at https://groups.google.com/group/google-appengine. >>> To view this discussion on the web visit >>> https://groups.google.com/d/msgid/google-appengine/fe551f25-65ef-4eef-a1fd-7e53e44f0f06%40googlegroups.com >>> >>> <https://groups.google.com/d/msgid/google-appengine/fe551f25-65ef-4eef-a1fd-7e53e44f0f06%40googlegroups.com?utm_medium=email&utm_source=footer> >>> . >>> >>> For more options, visit https://groups.google.com/d/optout. >>> >> >> -- You received this message because you are subscribed to the Google Groups "Google App Engine" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To post to this group, send email to [email protected]. Visit this group at https://groups.google.com/group/google-appengine. To view this discussion on the web visit https://groups.google.com/d/msgid/google-appengine/2b89c412-4c06-4bcc-a0f8-3441107e7856%40googlegroups.com. For more options, visit https://groups.google.com/d/optout.
