After a day, the problem is still happening. If there are any workarounds, I'd love to hear it, because I think this is costing us real money. Might I avoid this issue by using manual scaling, and calling the modules.SetNumInstances() API myself?
I can see the problem very clearly on the chart of CPU usage of these instances over time, as charted by Stackdriver. I'm attaching a graph that shows many instances getting started then killed within a few minutes (see all the instances whose usage never goes over 50%, none of them ever started serving traffic). Unfortunately, the stackdriver chart is completely illegible with horizontal lines if I set the time period to ~1 day, so I can't tell how long it took between the deploy and for this problem to start occurring. If it is useful, I'm happy to try pulling the raw data and looking at it. <https://lh3.googleusercontent.com/-gH8C3K7WM1Y/VxztjaQQkjI/AAAAAAAAADA/F_jr1hDw2cYKYTO79br3B8RuVbxsu3GaQCLcB/s1600/Screen%2BShot%2B2016-04-24%2Bat%2B11.44.08%2BAM.png> I checked the logs, and the symptoms are the same: The instance is started and deleted by the autoscaler very quickly. This does not seem to be how this should be working. For one example instance, the first message in syslog from the kernel is timestamped "Apr 24 15:25:03". I can see that vm_runtime_init is making progress, then the instance is killed, 3 minutes after it starts. (The create API was called at 11:24:37.572 and the delete completed at 11:30:50.494, so it ran for less than 7 minutes, meaning we got charged for 10 minutes of work that was totally useless.) Apr 24 15:25:54 gae-integration--track--gce-20160422t105255-j9wu vm_runtime_init: Apr 24 15:25:54 vm_runtime_init: Done start 'build_app'. Apr 24 15:25:54 gae-integration--track--gce-20160422t105255-j9wu vm_runtime_init: Apr 24 15:25:54 vm_runtime_init: Done start 'patch_dockerfile'. Apr 24 15:25:54 gae-integration--track--gce-20160422t105255-j9wu vm_runtime_init: Apr 24 15:25:54 vm_runtime_init: Done start 'pull_app'. Apr 24 15:25:54 gae-integration--track--gce-20160422t105255-j9wu vm_runtime_init: Apr 24 15:25:54 vm_runtime_init: start 'api_verifier'. Apr 24 15:25:54 gae-integration--track--gce-20160422t105255-j9wu vm_runtime_init: Apr 24 15:25:54 vm_runtime_init: start 'build_app'. Apr 24 15:25:54 gae-integration--track--gce-20160422t105255-j9wu vm_runtime_init: Apr 24 15:25:54 vm_runtime_init: start 'patch_dockerfile'. Apr 24 15:26:09 gae-integration--track--gce-20160422t105255-j9wu kernel: [ 76.160053] docker0: port 2(veth9d7335e) entered forwarding state Apr 24 15:27:56 gae-integration--track--gce-20160422t105255-j9wu kernel: [ 183.088252] docker0: port 2(veth9d7335e) entered disabled state Apr 24 15:27:56 gae-integration--track--gce-20160422t105255-j9wu kernel: [ 183.184346] docker0: port 2(veth9d7335e) entered disabled state Apr 24 15:27:56 gae-integration--track--gce-20160422t105255-j9wu kernel: [ 183.192162] device veth9d7335e left promiscuous mode Apr 24 15:27:56 gae-integration--track--gce-20160422t105255-j9wu kernel: [ 183.197552] docker0: port 2(veth9d7335e) entered disabled state Apr 24 15:28:00 gae-integration--track--gce-20160422t105255-j9wu init: Switching to runlevel: 1 the crash.log does have many messages like the following, so maybe there is a bug in the Appengine API proxy, and the health checking is correctly killing "bad" instances? API is not available yet because of exception: The remote RPC to the application server failed for call modules.GetNumInstances(). I may have another clue: I found an instance stuck at 0% CPU consumption because it somehow got stuck while attempting to reboot. This means I could capture the entire serial console log. I have the entire file in case it is useful, but I *think* it tried to restart because of these API errors? It also entered single user mode for some reason instead of actually restarting, which seems like a separate bug: ===== Unexpected error during VM startup ===== === Dump of VM runtime system logs follows === WARNING: HTTP 404 error while fetching metadata key gae_redirect_appengine_googleapis_com. Will treat it as an empty string. WARNING: HTTP 404 error while fetching metadata key gae_api_proxy_image_name. Will treat it as an empty string. WARNING: HTTP 404 error while fetching metadata key gae_app_container_memory_mb. Will treat it as an empty string. WARNING: HTTP 404 error while fetching metadata key gae_cloud_sql_instances. Will treat it as an empty string. WARNING: HTTP 404 error while fetching metadata key gae_cloud_sql_proxy_image_name. Will treat it as an empty string. WARNING: HTTP 404 error while fetching metadata key gae_extra_nginx_confs. Will treat it as an empty string. WARNING: HTTP 404 error while fetching metadata key gae_http_loadbalancer_enabled. Will treat it as an empty string. WARNING: HTTP 404 error while fetching metadata key gae_loadbalancer. Will treat it as an empty string. WARNING: HTTP 404 error while fetching metadata key gae_loadbalancer_ip. Will treat it as an empty string. WARNING: HTTP 404 error while fetching metadata key gae_memcache_proxy. Will treat it as an empty string. WARNING: HTTP 404 error while fetching metadata key gae_monitoring_image_name. Will treat it as an empty string. WARNING: HTTP 404 error while fetching metadata key gae_use_api_proxy. Will treat it as an empty string. WARNING: HTTP 404 error while fetching metadata key gae_use_cloud_monitoring. Will treat it as an empty string. vm_runtime_init: Apr 24 09:38:43 Invoking all VM runtime components. /dev/fd/63 vm_runtime_init: Apr 24 09:38:43 vm_runtime_init: start 'allow_ssh'. vm_runtime_init: Apr 24 09:38:43 vm_runtime_init: Done start 'allow_ssh'. vm_runtime_init: Apr 24 09:38:43 vm_runtime_init: start 'unlocker'. vm_runtime_init: Apr 24 09:38:43 vm_runtime_init: Done start 'unlocker'. vm_runtime_init: Apr 24 09:38:43 vm_runtime_init: start 'fluentd_logger'. vm_runtime_init: Apr 24 09:38:44 vm_runtime_init: Done start 'fluentd_logger'. vm_runtime_init: Apr 24 09:38:44 vm_runtime_init: start 'pull_app'. vm_runtime_init: Apr 24 09:38:45 Pulling GAE_FULL_APP_CONTAINER: us.gcr.io/triggeredmail/appengine/integration-track-gce.20160422t105255@sha256:52dde4c6b3d053419c247afa51d4ec4093392bba5bd7f713639cbc92e561bccf vm_runtime_init: Apr 24 09:39:19 Done pulling app container vm_runtime_init: Apr 24 09:39:19 vm_runtime_init: Done start 'pull_app'. vm_runtime_init: Apr 24 09:39:19 vm_runtime_init: start 'patch_dockerfile'. vm_runtime_init: Apr 24 09:39:19 vm_runtime_init: 'patch-dockerfile': skipping for empty patch. vm_runtime_init: Apr 24 09:39:19 vm_runtime_init: Done start 'patch_dockerfile'. vm_runtime_init: Apr 24 09:39:19 vm_runtime_init: start 'build_app'. vm_runtime_init: Apr 24 09:39:19 vm_runtime_init: 'build_app': skipping for gcloud based deployment. vm_runtime_init: Apr 24 09:39:19 vm_runtime_init: Done start 'build_app'. vm_runtime_init: Apr 24 09:39:19 vm_runtime_init: start 'api_verifier'. vm_runtime_init: Apr 24 09:39:19 checking the status of api_verifier vm_runtime_init: Apr 24 09:39:19 starting api_verifier No handlers could be found for logger "google.appengine.ext.vmruntime.vmstub" Apr 24 09:38:43 Invoking all VM runtime components. /dev/fd/63 Apr 24 09:38:43 vm_runtime_init: start 'allow_ssh'. Apr 24 09:38:43 vm_runtime_init: Done start 'allow_ssh'. Apr 24 09:38:43 vm_runtime_init: start 'unlocker'. Apr 24 09:38:43 vm_runtime_init: Done start 'unlocker'. Apr 24 09:38:43 vm_runtime_init: start 'fluentd_logger'. 8a7e9bf2f84fdf6a9381d00dc6766e2cf8766dc9c927952223bc3547ddd9913a Apr 24 09:38:44 vm_runtime_init: Done start 'fluentd_logger'. Apr 24 09:38:44 vm_runtime_init: start 'pull_app'. Apr 24 09:38:45 Pulling GAE_FULL_APP_CONTAINER: us.gcr.io/triggeredmail/appengine/integration-track-gce.20160422t105255@sha256:52dde4c6b3d053419c247afa51d4ec4093392bba5bd7f713639cbc92e561bccf sha256:52dde4c6b3d053419c247afa51d4ec4093392bba5bd7f713639cbc92e561bccf: Pulling from triggeredmail/appengine/integration-track-gce.20160422t105255 * ... lots of docker checksum messages* Digest: sha256:52dde4c6b3d053419c247afa51d4ec4093392bba5bd7f713639cbc92e561bccf Status: Downloaded newer image for us.gcr.io/triggeredmail/appengine/integration-track-gce.20160422t105255@sha256:52dde4c6b3d053419c247afa51d4ec4093392bba5bd7f713639cbc92e561bccf Apr 24 09:39:19 Done pulling app container Apr 24 09:39:19 vm_runtime_init: Done start 'pull_app'. Apr 24 09:39:19 vm_runtime_init: start 'patch_dockerfile'. Apr 24 09:39:19 vm_runtime_init: 'patch-dockerfile': skipping for empty patch. Apr 24 09:39:19 vm_runtime_init: Done start 'patch_dockerfile'. Apr 24 09:39:19 vm_runtime_init: start 'build_app'. Apr 24 09:39:19 vm_runtime_init: 'build_app': skipping for gcloud based deployment. Apr 24 09:39:19 vm_runtime_init: Done start 'build_app'. Apr 24 09:39:19 vm_runtime_init: start 'api_verifier'. Apr 24 09:39:19 checking the status of api_verifier Apr 24 09:39:19 starting api_verifier API is not available yet because of exception: The remote RPC to the application server failed for call modules.GetNumInstances(). *... about 100 more lines of this* CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES d8962ad1c641 gcr.io/google_appengine/api-verifier "./api_verifier.py" 2 minutes ago Exited (1) 2 seconds ago sad_ptolemy 8a7e9bf2f84f gcr.io/google_appengine/fluentd-logger "/opt/google-fluentd/" 2 minutes ago Up 2 minutes desperate_bell Container: d8962ad1c641 No handlers could be found for logger "google.appengine.ext.vmruntime.vmstub" API is not available yet because of exception: The remote RPC to the application server failed for call modules.GetNumInstances(). *... about another 100 lines* API is not available yet because of exception: The remote RPC to the application server failed for call modules.GetNumInstances(). Container: 8a7e9bf2f84f ========= rebooting. ======================== INIT: INIT: Sending processes the TERM signal INIT: Sending processes the KILL signal Apr 24 09:41:27 gae-integration--track--gce-20160422t105255-o4qv init: Switching to runlevel: 1 gcm-StatusUpdate:TIME=1461490887000;STATUS=COMMAND_FAILED;INVOCATION_ID=0 [ [36minfo [39;49m] Using makefile-style concurrent boot in runlevel 1. Stopping supervisor: supervisord. udhcpd: Disabled. Edit /etc/default/udhcpd to enable it. Apr 24 09:41:28 gae-integration--track--gce-20160422t105255-o4qv rpcbind: rpcbind terminating on signal. Restart with "rpcbind -w" Apr 24 09:41:28 gae-integration--track--gce-20160422t105255-o4qv rpc.statd[1783]: Caught signal 15, un-registering and exiting [....] Stopping rpcbind daemon... [?25l [?1c 7 [1G[ [32m ok [39;49m 8 [?25h [?0c. [....] Unmounting iscsi-backed filesystems: Unmounting all devices marked _netdev [?25l [?1c 7 [1G[ [32m ok [39;49m 8 [?25h [?0c. Apr 24 09:41:28 gae-integration--track--gce-20160422t105255-o4qv google: shutdown script found in metadata. Apr 24 09:41:28 gae-integration--track--gce-20160422t105255-o4qv shutdownscript: Running shutdown script /var/run/google.shutdown.script [....] Stopping NFS common utilities: idmapd statd [?25l [?1c 7 [1G[ [32m ok [39;49m 8 [?25h [?0c. Apr 24 09:41:28 gae-integration--track--gce-20160422t105255-o4qv iscsid: iscsid shutting down. [....] Unmounting iscsi-backed filesystems: Unmounting all devices marked _netdev [?25l [?1c 7 [1G[ [32m ok [39;49m 8 [?25h [?0c. [....] Disconnecting iSCSI targets:iscsiadm: No matching sessions found [?25l [?1c 7 [1G[ [32m ok [39;49m 8 [?25h [?0c. [....] Stopping iSCSI initiator service: [?25l [?1c 7 [1G[ [32m ok [39;49m 8 [?25h [?0c. Apr 24 09:41:28 gae-integration--track--gce-20160422t105255-o4qv shutdownscript: Finished running shutdown script /var/run/google.shutdown.script [....] Stopping Docker: docker [?25l [?1c 7 [1G[ [32m ok [39;49m 8 [?25h [?0c. [....] Stopping The Kubernetes container manager: kubelet [?25l [?1c 7 [1G[ [32m ok [39;49m 8 [?25h [?0c. [....] Stopping enhanced syslogd: rsyslogd [?25l [?1c 7 [1G[ [32m ok [39;49m 8 [?25h [?0c. [ 187.624222] docker0: port 1(vethffde071) entered disabled state [ 187.692308] docker0: port 1(vethffde071) entered disabled state [ 187.699568] device vethffde071 left promiscuous mode [ 187.705014] docker0: port 1(vethffde071) entered disabled state gcm-Heartbeat:1461490903000 Terminating on signal number 15 [....] Asking all remaining processes to terminate...acpid: exiting [?25l [?1c 7 [1G[ [32m ok [39;49m 8 [?25h [?0cdone. [....] All processes ended within 1 seconds.... [?25l [?1c 7 [1G[ [32m ok [39;49m 8 [?25h [?0cdone. [ [36minfo [39;4 INIT INIT: Sending processes the KILL signal sulogin: root account is locked, starting shell root@gae-integration--track--gce-20160422t105255-o4qv:~# -- You received this message because you are subscribed to the Google Groups "Google App Engine" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To post to this group, send email to [email protected]. Visit this group at https://groups.google.com/group/google-appengine. To view this discussion on the web visit https://groups.google.com/d/msgid/google-appengine/40c898ad-3176-447e-b468-38c0d6d0764f%40googlegroups.com. For more options, visit https://groups.google.com/d/optout.
