[google-appengine] Re: Flexible environment auto-scaling: Instances stopped after ~2 minutes, before app even runs

Evan Jones Sun, 24 Apr 2016 09:45:06 -0700

After a day, the problem is still happening. If there are any workarounds, 
I'd love to hear it, because I think this is costing us real money. Might I 
avoid this issue by using manual scaling, and calling the 
modules.SetNumInstances() API myself?


I can see the problem very clearly on the chart of CPU usage of these 
instances over time, as charted by Stackdriver. I'm attaching a graph that 
shows many instances getting started then killed within a few minutes (see 
all the instances whose usage never goes over 50%, none of them ever 
started serving traffic). Unfortunately, the stackdriver chart is 
completely illegible with horizontal lines if I set the time period to ~1 
day, so I can't tell how long it took between the deploy and for this 
problem to start occurring. If it is useful, I'm happy to try pulling the 
raw data and looking at it.

<https://lh3.googleusercontent.com/-gH8C3K7WM1Y/VxztjaQQkjI/AAAAAAAAADA/F_jr1hDw2cYKYTO79br3B8RuVbxsu3GaQCLcB/s1600/Screen%2BShot%2B2016-04-24%2Bat%2B11.44.08%2BAM.png>


I checked the logs, and the symptoms are the same: The instance is started 
and deleted by the autoscaler very quickly. This does not seem to be how 
this should be working.

For one example instance, the first message in syslog from the kernel is 
timestamped "Apr 24 15:25:03". I can see that vm_runtime_init is making 
progress, then the instance is killed, 3 minutes after it starts. (The 
create API was called at 11:24:37.572 and the delete completed 
at 11:30:50.494, so it ran for less than 7 minutes, meaning we got charged 
for 10 minutes of work that was totally useless.)


Apr 24 15:25:54 gae-integration--track--gce-20160422t105255-j9wu 
vm_runtime_init: Apr 24 15:25:54 vm_runtime_init: Done start 'build_app'.
Apr 24 15:25:54 gae-integration--track--gce-20160422t105255-j9wu 
vm_runtime_init: Apr 24 15:25:54 vm_runtime_init: Done start 
'patch_dockerfile'.
Apr 24 15:25:54 gae-integration--track--gce-20160422t105255-j9wu 
vm_runtime_init: Apr 24 15:25:54 vm_runtime_init: Done start 'pull_app'.
Apr 24 15:25:54 gae-integration--track--gce-20160422t105255-j9wu 
vm_runtime_init: Apr 24 15:25:54 vm_runtime_init: start 'api_verifier'.
Apr 24 15:25:54 gae-integration--track--gce-20160422t105255-j9wu 
vm_runtime_init: Apr 24 15:25:54 vm_runtime_init: start 'build_app'.
Apr 24 15:25:54 gae-integration--track--gce-20160422t105255-j9wu 
vm_runtime_init: Apr 24 15:25:54 vm_runtime_init: start 'patch_dockerfile'.
Apr 24 15:26:09 gae-integration--track--gce-20160422t105255-j9wu kernel: [ 
  76.160053] docker0: port 2(veth9d7335e) entered forwarding state
Apr 24 15:27:56 gae-integration--track--gce-20160422t105255-j9wu kernel: [ 
 183.088252] docker0: port 2(veth9d7335e) entered disabled state
Apr 24 15:27:56 gae-integration--track--gce-20160422t105255-j9wu kernel: [ 
 183.184346] docker0: port 2(veth9d7335e) entered disabled state
Apr 24 15:27:56 gae-integration--track--gce-20160422t105255-j9wu kernel: [ 
 183.192162] device veth9d7335e left promiscuous mode
Apr 24 15:27:56 gae-integration--track--gce-20160422t105255-j9wu kernel: [ 
 183.197552] docker0: port 2(veth9d7335e) entered disabled state
Apr 24 15:28:00 gae-integration--track--gce-20160422t105255-j9wu init: 
Switching to runlevel: 1


the crash.log does have many messages like the following, so maybe there is 
a bug in the Appengine API proxy, and the health checking is correctly 
killing "bad" instances?

API is not available yet because of exception: The remote RPC to the 
application server failed for call modules.GetNumInstances().



I may have another clue: I found an instance stuck at 0% CPU consumption 
because it somehow got stuck while attempting to reboot. This means I could 
capture the entire serial console log. I have the entire file in case it is 
useful, but I *think* it tried to restart because of these API errors? It 
also entered single user mode for some reason instead of actually 
restarting, which seems like a separate bug:


===== Unexpected error during VM startup =====

=== Dump of VM runtime system logs follows ===

WARNING: HTTP 404 error while fetching metadata key 
gae_redirect_appengine_googleapis_com. Will treat it as an empty string.

WARNING: HTTP 404 error while fetching metadata key 
gae_api_proxy_image_name. Will treat it as an empty string.

WARNING: HTTP 404 error while fetching metadata key 
gae_app_container_memory_mb. Will treat it as an empty string.

WARNING: HTTP 404 error while fetching metadata key 
gae_cloud_sql_instances. Will treat it as an empty string.

WARNING: HTTP 404 error while fetching metadata key 
gae_cloud_sql_proxy_image_name. Will treat it as an empty string.

WARNING: HTTP 404 error while fetching metadata key gae_extra_nginx_confs. 
Will treat it as an empty string.

WARNING: HTTP 404 error while fetching metadata key 
gae_http_loadbalancer_enabled. Will treat it as an empty string.

WARNING: HTTP 404 error while fetching metadata key gae_loadbalancer. Will 
treat it as an empty string.

WARNING: HTTP 404 error while fetching metadata key gae_loadbalancer_ip. 
Will treat it as an empty string.

WARNING: HTTP 404 error while fetching metadata key gae_memcache_proxy. 
Will treat it as an empty string.

WARNING: HTTP 404 error while fetching metadata key 
gae_monitoring_image_name. Will treat it as an empty string.

WARNING: HTTP 404 error while fetching metadata key gae_use_api_proxy. Will 
treat it as an empty string.

WARNING: HTTP 404 error while fetching metadata key 
gae_use_cloud_monitoring. Will treat it as an empty string.

vm_runtime_init: Apr 24 09:38:43 Invoking all VM runtime components. 
/dev/fd/63

vm_runtime_init: Apr 24 09:38:43 vm_runtime_init: start 'allow_ssh'.

vm_runtime_init: Apr 24 09:38:43 vm_runtime_init: Done start 'allow_ssh'.

vm_runtime_init: Apr 24 09:38:43 vm_runtime_init: start 'unlocker'.

vm_runtime_init: Apr 24 09:38:43 vm_runtime_init: Done start 'unlocker'.

vm_runtime_init: Apr 24 09:38:43 vm_runtime_init: start 'fluentd_logger'.

vm_runtime_init: Apr 24 09:38:44 vm_runtime_init: Done start 
'fluentd_logger'.

vm_runtime_init: Apr 24 09:38:44 vm_runtime_init: start 'pull_app'.

vm_runtime_init: Apr 24 09:38:45 Pulling GAE_FULL_APP_CONTAINER: 
us.gcr.io/triggeredmail/appengine/integration-track-gce.20160422t105255@sha256:52dde4c6b3d053419c247afa51d4ec4093392bba5bd7f713639cbc92e561bccf

vm_runtime_init: Apr 24 09:39:19 Done pulling app container

vm_runtime_init: Apr 24 09:39:19 vm_runtime_init: Done start 'pull_app'.

vm_runtime_init: Apr 24 09:39:19 vm_runtime_init: start 'patch_dockerfile'.

vm_runtime_init: Apr 24 09:39:19 vm_runtime_init: 'patch-dockerfile': 
skipping for empty patch.

vm_runtime_init: Apr 24 09:39:19 vm_runtime_init: Done start 
'patch_dockerfile'.

vm_runtime_init: Apr 24 09:39:19 vm_runtime_init: start 'build_app'.

vm_runtime_init: Apr 24 09:39:19 vm_runtime_init: 'build_app': skipping for 
gcloud based deployment.

vm_runtime_init: Apr 24 09:39:19 vm_runtime_init: Done start 'build_app'.

vm_runtime_init: Apr 24 09:39:19 vm_runtime_init: start 'api_verifier'.

vm_runtime_init: Apr 24 09:39:19 checking the status of api_verifier

vm_runtime_init: Apr 24 09:39:19 starting api_verifier

No handlers could be found for logger 
"google.appengine.ext.vmruntime.vmstub"

Apr 24 09:38:43 Invoking all VM runtime components. /dev/fd/63

Apr 24 09:38:43 vm_runtime_init: start 'allow_ssh'.

Apr 24 09:38:43 vm_runtime_init: Done start 'allow_ssh'.

Apr 24 09:38:43 vm_runtime_init: start 'unlocker'.

Apr 24 09:38:43 vm_runtime_init: Done start 'unlocker'.

Apr 24 09:38:43 vm_runtime_init: start 'fluentd_logger'.

8a7e9bf2f84fdf6a9381d00dc6766e2cf8766dc9c927952223bc3547ddd9913a

Apr 24 09:38:44 vm_runtime_init: Done start 'fluentd_logger'.

Apr 24 09:38:44 vm_runtime_init: start 'pull_app'.

Apr 24 09:38:45 Pulling GAE_FULL_APP_CONTAINER: 
us.gcr.io/triggeredmail/appengine/integration-track-gce.20160422t105255@sha256:52dde4c6b3d053419c247afa51d4ec4093392bba5bd7f713639cbc92e561bccf

sha256:52dde4c6b3d053419c247afa51d4ec4093392bba5bd7f713639cbc92e561bccf: 
Pulling from triggeredmail/appengine/integration-track-gce.20160422t105255


*  ... lots of docker checksum messages*


Digest: 
sha256:52dde4c6b3d053419c247afa51d4ec4093392bba5bd7f713639cbc92e561bccf

Status: Downloaded newer image for 
us.gcr.io/triggeredmail/appengine/integration-track-gce.20160422t105255@sha256:52dde4c6b3d053419c247afa51d4ec4093392bba5bd7f713639cbc92e561bccf

Apr 24 09:39:19 Done pulling app container

Apr 24 09:39:19 vm_runtime_init: Done start 'pull_app'.

Apr 24 09:39:19 vm_runtime_init: start 'patch_dockerfile'.

Apr 24 09:39:19 vm_runtime_init: 'patch-dockerfile': skipping for empty 
patch.

Apr 24 09:39:19 vm_runtime_init: Done start 'patch_dockerfile'.

Apr 24 09:39:19 vm_runtime_init: start 'build_app'.

Apr 24 09:39:19 vm_runtime_init: 'build_app': skipping for gcloud based 
deployment.

Apr 24 09:39:19 vm_runtime_init: Done start 'build_app'.

Apr 24 09:39:19 vm_runtime_init: start 'api_verifier'.

Apr 24 09:39:19 checking the status of api_verifier

Apr 24 09:39:19 starting api_verifier

API is not available yet because of exception: The remote RPC to the 
application server failed for call modules.GetNumInstances().


*... about 100 more lines of this*


CONTAINER ID        IMAGE                                    COMMAND       
           CREATED             STATUS                     PORTS             
  NAMES

d8962ad1c641        gcr.io/google_appengine/api-verifier     
"./api_verifier.py"      2 minutes ago       Exited (1) 2 seconds ago       
                sad_ptolemy

8a7e9bf2f84f        gcr.io/google_appengine/fluentd-logger   
"/opt/google-fluentd/"   2 minutes ago       Up 2 minutes                   
                desperate_bell

Container: d8962ad1c641

No handlers could be found for logger 
"google.appengine.ext.vmruntime.vmstub"

API is not available yet because of exception: The remote RPC to the 
application server failed for call modules.GetNumInstances().


*... about another 100 lines*


API is not available yet because of exception: The remote RPC to the 
application server failed for call modules.GetNumInstances().

Container: 8a7e9bf2f84f

========= rebooting. ========================


INIT: 

INIT: Sending processes the TERM signal



INIT: Sending processes the KILL signal


Apr 24 09:41:27 gae-integration--track--gce-20160422t105255-o4qv init: 
Switching to runlevel: 1

gcm-StatusUpdate:TIME=1461490887000;STATUS=COMMAND_FAILED;INVOCATION_ID=0

[ [36minfo [39;49m] Using makefile-style concurrent boot in runlevel 1.

Stopping supervisor: supervisord.

udhcpd: Disabled. Edit /etc/default/udhcpd to enable it.

Apr 24 09:41:28 gae-integration--track--gce-20160422t105255-o4qv rpcbind: 
rpcbind terminating on signal. Restart with "rpcbind -w"

Apr 24 09:41:28 gae-integration--track--gce-20160422t105255-o4qv 
rpc.statd[1783]: Caught signal 15, un-registering and exiting

[....] Stopping rpcbind daemon... [?25l [?1c 7 [1G[ [32m ok 
 [39;49m 8 [?25h [?0c.

[....] Unmounting iscsi-backed filesystems: Unmounting all devices marked 
_netdev [?25l [?1c 7 [1G[ [32m ok  [39;49m 8 [?25h [?0c.

Apr 24 09:41:28 gae-integration--track--gce-20160422t105255-o4qv google: 
shutdown script found in metadata.

Apr 24 09:41:28 gae-integration--track--gce-20160422t105255-o4qv 
shutdownscript: Running shutdown script /var/run/google.shutdown.script

[....] Stopping NFS common utilities: idmapd statd [?25l [?1c 7 [1G[ [32m 
ok  [39;49m 8 [?25h [?0c.

Apr 24 09:41:28 gae-integration--track--gce-20160422t105255-o4qv iscsid: 
iscsid shutting down.

[....] Unmounting iscsi-backed filesystems: Unmounting all devices marked 
_netdev [?25l [?1c 7 [1G[ [32m ok  [39;49m 8 [?25h [?0c.

[....] Disconnecting iSCSI targets:iscsiadm: No matching sessions found

 [?25l [?1c 7 [1G[ [32m ok  [39;49m 8 [?25h [?0c.

[....] Stopping iSCSI initiator service: [?25l [?1c 7 [1G[ [32m ok 
 [39;49m 8 [?25h [?0c.

Apr 24 09:41:28 gae-integration--track--gce-20160422t105255-o4qv 
shutdownscript: Finished running shutdown script 
/var/run/google.shutdown.script

[....] Stopping Docker: docker [?25l [?1c 7 [1G[ [32m ok 
 [39;49m 8 [?25h [?0c.

[....] Stopping The Kubernetes container manager: 
kubelet [?25l [?1c 7 [1G[ [32m ok  [39;49m 8 [?25h [?0c.

[....] Stopping enhanced syslogd: rsyslogd [?25l [?1c 7 [1G[ [32m ok 
 [39;49m 8 [?25h [?0c.

[  187.624222] docker0: port 1(vethffde071) entered disabled state

[  187.692308] docker0: port 1(vethffde071) entered disabled state

[  187.699568] device vethffde071 left promiscuous mode

[  187.705014] docker0: port 1(vethffde071) entered disabled state

gcm-Heartbeat:1461490903000

Terminating on signal number 15

[....] Asking all remaining processes to terminate...acpid: exiting


 [?25l [?1c 7 [1G[ [32m ok  [39;49m 8 [?25h [?0cdone.

[....] All processes ended within 1 seconds.... [?25l [?1c 7 [1G[ [32m ok 
 [39;49m 8 [?25h [?0cdone.

[ [36minfo [39;4

INIT

INIT: Sending processes the KILL signal


sulogin: root account is locked, starting shell

root@gae-integration--track--gce-20160422t105255-o4qv:~# 


-- 
You received this message because you are subscribed to the Google Groups 
"Google App Engine" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at https://groups.google.com/group/google-appengine.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/google-appengine/40c898ad-3176-447e-b468-38c0d6d0764f%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

[google-appengine] Re: Flexible environment auto-scaling: Instances stopped after ~2 minutes, before app even runs

Reply via email to