This is still happening, although we are re-deploying the module more frequently now, which mitigates the problem somewhat. I filed an issue tracker issue for this:
https://code.google.com/p/googleappengine/issues/detail?id=12938&thanks=12938&ts=1462203290 I'm hoping that maybe if we write our own autoscaler, we might avoid this bug. On Sunday, April 24, 2016 at 12:44:40 PM UTC-4, Evan Jones wrote: > > After a day, the problem is still happening. If there are any workarounds, > I'd love to hear it, because I think this is costing us real money. Might I > avoid this issue by using manual scaling, and calling the > modules.SetNumInstances() API myself? > > I can see the problem very clearly on the chart of CPU usage of these > instances over time, as charted by Stackdriver. I'm attaching a graph that > shows many instances getting started then killed within a few minutes (see > all the instances whose usage never goes over 50%, none of them ever > started serving traffic). Unfortunately, the stackdriver chart is > completely illegible with horizontal lines if I set the time period to ~1 > day, so I can't tell how long it took between the deploy and for this > problem to start occurring. If it is useful, I'm happy to try pulling the > raw data and looking at it. > > > <https://lh3.googleusercontent.com/-gH8C3K7WM1Y/VxztjaQQkjI/AAAAAAAAADA/F_jr1hDw2cYKYTO79br3B8RuVbxsu3GaQCLcB/s1600/Screen%2BShot%2B2016-04-24%2Bat%2B11.44.08%2BAM.png> > > > I checked the logs, and the symptoms are the same: The instance is started > and deleted by the autoscaler very quickly. This does not seem to be how > this should be working. > > For one example instance, the first message in syslog from the kernel is > timestamped "Apr 24 15:25:03". I can see that vm_runtime_init is making > progress, then the instance is killed, 3 minutes after it starts. (The > create API was called at 11:24:37.572 and the delete completed > at 11:30:50.494, so it ran for less than 7 minutes, meaning we got charged > for 10 minutes of work that was totally useless.) > > > Apr 24 15:25:54 gae-integration--track--gce-20160422t105255-j9wu > vm_runtime_init: Apr 24 15:25:54 vm_runtime_init: Done start 'build_app'. > Apr 24 15:25:54 gae-integration--track--gce-20160422t105255-j9wu > vm_runtime_init: Apr 24 15:25:54 vm_runtime_init: Done start > 'patch_dockerfile'. > Apr 24 15:25:54 gae-integration--track--gce-20160422t105255-j9wu > vm_runtime_init: Apr 24 15:25:54 vm_runtime_init: Done start 'pull_app'. > Apr 24 15:25:54 gae-integration--track--gce-20160422t105255-j9wu > vm_runtime_init: Apr 24 15:25:54 vm_runtime_init: start 'api_verifier'. > Apr 24 15:25:54 gae-integration--track--gce-20160422t105255-j9wu > vm_runtime_init: Apr 24 15:25:54 vm_runtime_init: start 'build_app'. > Apr 24 15:25:54 gae-integration--track--gce-20160422t105255-j9wu > vm_runtime_init: Apr 24 15:25:54 vm_runtime_init: start 'patch_dockerfile'. > Apr 24 15:26:09 gae-integration--track--gce-20160422t105255-j9wu kernel: [ > 76.160053] docker0: port 2(veth9d7335e) entered forwarding state > Apr 24 15:27:56 gae-integration--track--gce-20160422t105255-j9wu kernel: [ > 183.088252] docker0: port 2(veth9d7335e) entered disabled state > Apr 24 15:27:56 gae-integration--track--gce-20160422t105255-j9wu kernel: [ > 183.184346] docker0: port 2(veth9d7335e) entered disabled state > Apr 24 15:27:56 gae-integration--track--gce-20160422t105255-j9wu kernel: [ > 183.192162] device veth9d7335e left promiscuous mode > Apr 24 15:27:56 gae-integration--track--gce-20160422t105255-j9wu kernel: [ > 183.197552] docker0: port 2(veth9d7335e) entered disabled state > Apr 24 15:28:00 gae-integration--track--gce-20160422t105255-j9wu init: > Switching to runlevel: 1 > > > the crash.log does have many messages like the following, so maybe there > is a bug in the Appengine API proxy, and the health checking is correctly > killing "bad" instances? > > API is not available yet because of exception: The remote RPC to the > application server failed for call modules.GetNumInstances(). > > > > I may have another clue: I found an instance stuck at 0% CPU consumption > because it somehow got stuck while attempting to reboot. This means I could > capture the entire serial console log. I have the entire file in case it is > useful, but I *think* it tried to restart because of these API errors? It > also entered single user mode for some reason instead of actually > restarting, which seems like a separate bug: > > > ===== Unexpected error during VM startup ===== > > === Dump of VM runtime system logs follows === > > WARNING: HTTP 404 error while fetching metadata key > gae_redirect_appengine_googleapis_com. Will treat it as an empty string. > > WARNING: HTTP 404 error while fetching metadata key > gae_api_proxy_image_name. Will treat it as an empty string. > > WARNING: HTTP 404 error while fetching metadata key > gae_app_container_memory_mb. Will treat it as an empty string. > > WARNING: HTTP 404 error while fetching metadata key > gae_cloud_sql_instances. Will treat it as an empty string. > > WARNING: HTTP 404 error while fetching metadata key > gae_cloud_sql_proxy_image_name. Will treat it as an empty string. > > WARNING: HTTP 404 error while fetching metadata key gae_extra_nginx_confs. > Will treat it as an empty string. > > WARNING: HTTP 404 error while fetching metadata key > gae_http_loadbalancer_enabled. Will treat it as an empty string. > > WARNING: HTTP 404 error while fetching metadata key gae_loadbalancer. Will > treat it as an empty string. > > WARNING: HTTP 404 error while fetching metadata key gae_loadbalancer_ip. > Will treat it as an empty string. > > WARNING: HTTP 404 error while fetching metadata key gae_memcache_proxy. > Will treat it as an empty string. > > WARNING: HTTP 404 error while fetching metadata key > gae_monitoring_image_name. Will treat it as an empty string. > > WARNING: HTTP 404 error while fetching metadata key gae_use_api_proxy. > Will treat it as an empty string. > > WARNING: HTTP 404 error while fetching metadata key > gae_use_cloud_monitoring. Will treat it as an empty string. > > vm_runtime_init: Apr 24 09:38:43 Invoking all VM runtime components. > /dev/fd/63 > > vm_runtime_init: Apr 24 09:38:43 vm_runtime_init: start 'allow_ssh'. > > vm_runtime_init: Apr 24 09:38:43 vm_runtime_init: Done start 'allow_ssh'. > > vm_runtime_init: Apr 24 09:38:43 vm_runtime_init: start 'unlocker'. > > vm_runtime_init: Apr 24 09:38:43 vm_runtime_init: Done start 'unlocker'. > > vm_runtime_init: Apr 24 09:38:43 vm_runtime_init: start 'fluentd_logger'. > > vm_runtime_init: Apr 24 09:38:44 vm_runtime_init: Done start > 'fluentd_logger'. > > vm_runtime_init: Apr 24 09:38:44 vm_runtime_init: start 'pull_app'. > > vm_runtime_init: Apr 24 09:38:45 Pulling GAE_FULL_APP_CONTAINER: > us.gcr.io/triggeredmail/appengine/integration-track-gce.20160422t105255@sha256:52dde4c6b3d053419c247afa51d4ec4093392bba5bd7f713639cbc92e561bccf > > vm_runtime_init: Apr 24 09:39:19 Done pulling app container > > vm_runtime_init: Apr 24 09:39:19 vm_runtime_init: Done start 'pull_app'. > > vm_runtime_init: Apr 24 09:39:19 vm_runtime_init: start 'patch_dockerfile'. > > vm_runtime_init: Apr 24 09:39:19 vm_runtime_init: 'patch-dockerfile': > skipping for empty patch. > > vm_runtime_init: Apr 24 09:39:19 vm_runtime_init: Done start > 'patch_dockerfile'. > > vm_runtime_init: Apr 24 09:39:19 vm_runtime_init: start 'build_app'. > > vm_runtime_init: Apr 24 09:39:19 vm_runtime_init: 'build_app': skipping > for gcloud based deployment. > > vm_runtime_init: Apr 24 09:39:19 vm_runtime_init: Done start 'build_app'. > > vm_runtime_init: Apr 24 09:39:19 vm_runtime_init: start 'api_verifier'. > > vm_runtime_init: Apr 24 09:39:19 checking the status of api_verifier > > vm_runtime_init: Apr 24 09:39:19 starting api_verifier > > No handlers could be found for logger > "google.appengine.ext.vmruntime.vmstub" > > Apr 24 09:38:43 Invoking all VM runtime components. /dev/fd/63 > > Apr 24 09:38:43 vm_runtime_init: start 'allow_ssh'. > > Apr 24 09:38:43 vm_runtime_init: Done start 'allow_ssh'. > > Apr 24 09:38:43 vm_runtime_init: start 'unlocker'. > > Apr 24 09:38:43 vm_runtime_init: Done start 'unlocker'. > > Apr 24 09:38:43 vm_runtime_init: start 'fluentd_logger'. > > 8a7e9bf2f84fdf6a9381d00dc6766e2cf8766dc9c927952223bc3547ddd9913a > > Apr 24 09:38:44 vm_runtime_init: Done start 'fluentd_logger'. > > Apr 24 09:38:44 vm_runtime_init: start 'pull_app'. > > Apr 24 09:38:45 Pulling GAE_FULL_APP_CONTAINER: > us.gcr.io/triggeredmail/appengine/integration-track-gce.20160422t105255@sha256:52dde4c6b3d053419c247afa51d4ec4093392bba5bd7f713639cbc92e561bccf > > sha256:52dde4c6b3d053419c247afa51d4ec4093392bba5bd7f713639cbc92e561bccf: > Pulling from triggeredmail/appengine/integration-track-gce.20160422t105255 > > > * ... lots of docker checksum messages* > > > Digest: > sha256:52dde4c6b3d053419c247afa51d4ec4093392bba5bd7f713639cbc92e561bccf > > Status: Downloaded newer image for > us.gcr.io/triggeredmail/appengine/integration-track-gce.20160422t105255@sha256:52dde4c6b3d053419c247afa51d4ec4093392bba5bd7f713639cbc92e561bccf > > Apr 24 09:39:19 Done pulling app container > > Apr 24 09:39:19 vm_runtime_init: Done start 'pull_app'. > > Apr 24 09:39:19 vm_runtime_init: start 'patch_dockerfile'. > > Apr 24 09:39:19 vm_runtime_init: 'patch-dockerfile': skipping for empty > patch. > > Apr 24 09:39:19 vm_runtime_init: Done start 'patch_dockerfile'. > > Apr 24 09:39:19 vm_runtime_init: start 'build_app'. > > Apr 24 09:39:19 vm_runtime_init: 'build_app': skipping for gcloud based > deployment. > > Apr 24 09:39:19 vm_runtime_init: Done start 'build_app'. > > Apr 24 09:39:19 vm_runtime_init: start 'api_verifier'. > > Apr 24 09:39:19 checking the status of api_verifier > > Apr 24 09:39:19 starting api_verifier > > API is not available yet because of exception: The remote RPC to the > application server failed for call modules.GetNumInstances(). > > > *... about 100 more lines of this* > > > CONTAINER ID IMAGE COMMAND > CREATED STATUS PORTS > NAMES > > d8962ad1c641 gcr.io/google_appengine/api-verifier > "./api_verifier.py" 2 minutes ago Exited (1) 2 seconds ago > sad_ptolemy > > 8a7e9bf2f84f gcr.io/google_appengine/fluentd-logger > "/opt/google-fluentd/" 2 minutes ago Up 2 minutes > desperate_bell > > Container: d8962ad1c641 > > No handlers could be found for logger > "google.appengine.ext.vmruntime.vmstub" > > API is not available yet because of exception: The remote RPC to the > application server failed for call modules.GetNumInstances(). > > > *... about another 100 lines* > > > API is not available yet because of exception: The remote RPC to the > application server failed for call modules.GetNumInstances(). > > Container: 8a7e9bf2f84f > > ========= rebooting. ======================== > > > INIT: > > INIT: Sending processes the TERM signal > > > > INIT: Sending processes the KILL signal > > > Apr 24 09:41:27 gae-integration--track--gce-20160422t105255-o4qv init: > Switching to runlevel: 1 > > gcm-StatusUpdate:TIME=1461490887000;STATUS=COMMAND_FAILED;INVOCATION_ID=0 > > [ [36minfo [39;49m] Using makefile-style concurrent boot in runlevel 1. > > Stopping supervisor: supervisord. > > udhcpd: Disabled. Edit /etc/default/udhcpd to enable it. > > Apr 24 09:41:28 gae-integration--track--gce-20160422t105255-o4qv rpcbind: > rpcbind terminating on signal. Restart with "rpcbind -w" > > Apr 24 09:41:28 gae-integration--track--gce-20160422t105255-o4qv > rpc.statd[1783]: Caught signal 15, un-registering and exiting > > [....] Stopping rpcbind daemon... [?25l [?1c 7 [1G[ [32m ok [39;49m 8 > [?25h [?0c. > > [....] Unmounting iscsi-backed filesystems: Unmounting all devices marked > _netdev [?25l [?1c 7 [1G[ [32m ok [39;49m 8 [?25h [?0c. > > Apr 24 09:41:28 gae-integration--track--gce-20160422t105255-o4qv google: > shutdown script found in metadata. > > Apr 24 09:41:28 gae-integration--track--gce-20160422t105255-o4qv > shutdownscript: Running shutdown script /var/run/google.shutdown.script > > [....] Stopping NFS common utilities: idmapd statd [?25l [?1c 7 [1G[ [32m > ok [39;49m 8 [?25h [?0c. > > Apr 24 09:41:28 gae-integration--track--gce-20160422t105255-o4qv iscsid: > iscsid shutting down. > > [....] Unmounting iscsi-backed filesystems: Unmounting all devices marked > _netdev [?25l [?1c 7 [1G[ [32m ok [39;49m 8 [?25h [?0c. > > [....] Disconnecting iSCSI targets:iscsiadm: No matching sessions found > > [?25l [?1c 7 [1G[ [32m ok [39;49m 8 [?25h [?0c. > > [....] Stopping iSCSI initiator service: [?25l [?1c 7 [1G[ [32m ok [39;49m > 8 [?25h [?0c. > > Apr 24 09:41:28 gae-integration--track--gce-20160422t105255-o4qv > shutdownscript: Finished running shutdown script > /var/run/google.shutdown.script > > [....] Stopping Docker: docker [?25l [?1c 7 [1G[ [32m ok [39;49m 8 [?25h > [?0c. > > [....] Stopping The Kubernetes container manager: kubelet [?25l [?1c 7 > [1G[ [32m ok [39;49m 8 [?25h [?0c. > > [....] Stopping enhanced syslogd: rsyslogd [?25l [?1c 7 [1G[ [32m ok > [39;49m 8 [?25h [?0c. > > [ 187.624222] docker0: port 1(vethffde071) entered disabled state > > [ 187.692308] docker0: port 1(vethffde071) entered disabled state > > [ 187.699568] device vethffde071 left promiscuous mode > > [ 187.705014] docker0: port 1(vethffde071) entered disabled state > > gcm-Heartbeat:1461490903000 > > Terminating on signal number 15 > > [....] Asking all remaining processes to terminate...acpid: exiting > > > [?25l [?1c 7 [1G[ [32m ok [39;49m 8 [?25h [?0cdone. > > [....] All processes ended within 1 seconds.... [?25l [?1c 7 [1G[ [32m ok > [39;49m 8 [?25h [?0cdone. > > [ [36minfo [39;4 > > INIT > > INIT: Sending processes the KILL signal > > > sulogin: root account is locked, starting shell > > root@gae-integration--track--gce-20160422t105255-o4qv:~# > > > -- You received this message because you are subscribed to the Google Groups "Google App Engine" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To post to this group, send email to [email protected]. Visit this group at https://groups.google.com/group/google-appengine. To view this discussion on the web visit https://groups.google.com/d/msgid/google-appengine/2ce1cb9d-06b9-422d-9d59-93e13bfc9cc1%40googlegroups.com. For more options, visit https://groups.google.com/d/optout.
