[google-appengine] Re: Flexible environment auto-scaling: Instances stopped after ~2 minutes, before app even runs

Evan Jones Thu, 21 Apr 2016 12:28:30 -0700

Thanks for the response.

1. Yes, this is still happening. It happens probably about ~3-30 times per 
hour.
2. Currently we only have one "flexible" module in production, so it 
currently only happens on that one.
3. I have no health checking configured; I'm relying on the default 
configuration.
4. We have no custom code responding to health checks.
5. We are using the standard runtime:


runtime: python-compat
threadsafe: true
vm: true


One other fact that might matter: our application container is pretty 
large. Since the flexible environment doesn't seem to obey "skip files", 
its ~100MB. Maybe it just takes a while for it to download.


I'm happy to file a support ticket with our Silver level enterprise support 
if that is a better forum than these groups. Thanks!

Evan



On Thursday, April 21, 2016 at 3:19:37 PM UTC-4, Nicholas (Google Cloud 
Support) wrote:
>
> Thanks for posting your issue here and included various logs.
>
> Regarding your question about the autoscaler, I do not know if the 
> flexible runtime infrastructure uses a standard GCE autoscaler for managing 
> VM instances.
>
> With the errors your getting, we'd need to investigate this deeper to 
> understand what has been happening:
>
>    - Is this still happening?
>    - Is this limited to a specific module or all your flexible runtime 
>    modules?
>    - What health checking configuration do you have setup in your yaml 
>    file?
>    - Can you provide a code snippet of the code responding to health 
>    checks?
>    - Are you using a standard runtime or custom runtime? If custom, how 
>    is your Dockerfile setup?
>    
> Forgive me for asking for so much generally but it's not entirely clear 
> where this issue is arising. The information requested above may be helpful 
> to learning more about this.
>
> On Wednesday, April 20, 2016 at 6:14:46 PM UTC-4, Evan Jones wrote:
>>
>> *TL;DR*: I'm seeing the flexible environment autoscaler start a machine, 
>> then stop it 2 minutes later, before it has even finished starting the 
>> application. Is this a misconfiguration on my side or a bug? Could the 
>> issue be that our application loads too slowly? Can I reconfigure this 
>> somehow? We see lots of instances "fail to start" like this, and I can't 
>> figure out why exactly.
>>
>>
>> *Longer description*
>>
>> We've been experimenting with moving parts of our application to the new 
>> "Flexible" environment. So far it works well! However, I've noticed that 
>> the autoscaler is extremely aggressive at stopping instances, which 
>> contradicts the documentation. I'm assuming the Flexible environment using 
>> the GCE autoscaler. The documentation there states "the 10 minute delay 
>> prevents virtual machine instances from being terminated before the 
>> instance has been running for 10 minutes. Cool down periods for new 
>> instances are ignored when deciding whether to scale down a group":
>>
>>
>> https://cloud.google.com/compute/docs/autoscaler/understanding-autoscaler-decisions
>>
>> However, I've been looking at when instances are created and stopped, and 
>> I see many that get stopped after ~2-3 minutes before the machine has even 
>> booted! It seems to me from the description above, it should let my machine 
>> run for at least 10 minutes before stopping it. Could this be getting 
>> killed by some sort of health checking? This module is using the defaults, 
>> which looks like it should wait at least 5 minutes before declaring an 
>> instance unhealthy.
>>
>> I've actually seen a number of other issues that seem to cause VMs to 
>> fail to start:
>>
>> * Long sequence of "API is not available yet because of exception: The 
>> remote RPC to the application server failed for call 
>> modules.GetNumInstances()" errors, followed by shutdown after about 
>>
>>
>> *Technical details*
>>
>> In the compute engine API logs, I see some create/delete records like the 
>> following, showing an instance created then deleted within ~2 minutes:
>>
>>
>> 17:11:15.264 compute.instances.insert type:GCE_API_CALL resource 
>> id:"2166106899089084061"
>> 17:11:51.535   type:GCE_OPERATION_DONE (matching the previous call)
>> 17:13:40.427 compute.instances.delete type:GCE_API_CALL resource 
>> id:"2166106899089084061"
>> 17:14:30.572   type":GCE_OPERATION_DONE
>>
>>
>> When I search for logs from this instance itself, the last messages 
>> across all logs are from syslog, showing it was still running the "startup" 
>> process provided by Google when the shutdown signal from the delete 
>> operation was executed. It seems from this log it didn't even get to our 
>> code; this is killed while running Google's code I think. I see similar 
>> issues where instances run for ~5 minutes, and have a sequence of errors 
>> related to "api_verifier" returning many errors.
>>
>> Any ideas what is causing this, and what we can do to make it happen less 
>> often?
>>
>> Thank you!
>>
>>
>>
>>
>> Apr 20 21:11:50 gae-integration--track--gce-20160415t135311-cyl8 
>> vm_runtime_init: Apr 20 21:11:50 Pulling GAE_FULL_APP_CONTAINER: 
>> us.gcr.io/TRUNCATED@sha256:f70b09a8217430bde2e45d6ccc868e3699b92f965eff12cc6ed19865ce131585
>> Apr 20 21:12:03 gae-integration--track--gce-20160415t135311-cyl8 kernel: 
>> [   27.964060] docker0: port 1(veth75e5c89) entered forwarding state
>> Apr 20 21:12:48 gae-integration--track--gce-20160415t135311-cyl8 kernel: 
>> [   72.892834] aufs au_opts_verify:1570:docker[2262]: dirperm1 breaks the 
>> protection by the permission bits on the lower branch
>> Apr 20 21:12:48 gae-integration--track--gce-20160415t135311-cyl8 
>> vm_runtime_init: Apr 20 21:12:48 vm_runtime_init: 'build_app': skipping for 
>> gcloud based deployment.
>> Apr 20 21:12:48 gae-integration--track--gce-20160415t135311-cyl8 
>> vm_runtime_init: Apr 20 21:12:48 vm_runtime_init: 'patch-dockerfile': 
>> skipping for empty patch.
>> Apr 20 21:12:48 gae-integration--track--gce-20160415t135311-cyl8 kernel: 
>> [   73.034744] aufs au_opts_verify:1570:docker[2262]: dirperm1 breaks the 
>> protection by the permission bits on the lower branch
>> Apr 20 21:12:48 gae-integration--track--gce-20160415t135311-cyl8 
>> vm_runtime_init: Apr 20 21:12:48 vm_runtime_init: start 'patch_dockerfile'.
>> Apr 20 21:12:48 gae-integration--track--gce-20160415t135311-cyl8 kernel: 
>> [   73.348051] docker0: port 2(veth0dcc920) entered forwarding state
>> Apr 20 21:12:48 gae-integration--track--gce-20160415t135311-cyl8 kernel: 
>> [   73.076657] aufs au_opts_verify:1570:docker[2262]: dirperm1 breaks the 
>> protection by the permission bits on the lower branch
>> Apr 20 21:12:48 gae-integration--track--gce-20160415t135311-cyl8 
>> vm_runtime_init: Apr 20 21:12:48 vm_runtime_init: start 'api_verifier'.
>> Apr 20 21:12:48 gae-integration--track--gce-20160415t135311-cyl8 
>> vm_runtime_init: Apr 20 21:12:48 vm_runtime_init: Done start 'build_app'.
>> Apr 20 21:12:48 gae-integration--track--gce-20160415t135311-cyl8 kernel: 
>> [   73.354433] docker0: port 2(veth0dcc920) entered forwarding state
>> Apr 20 21:12:48 gae-integration--track--gce-20160415t135311-cyl8 
>> vm_runtime_init: Apr 20 21:12:48 vm_runtime_init: start 'build_app'.
>> Apr 20 21:12:48 gae-integration--track--gce-20160415t135311-cyl8 kernel: 
>> [   73.094180] device veth0dcc920 entered promiscuous mode
>> Apr 20 21:12:48 gae-integration--track--gce-20160415t135311-cyl8 
>> vm_runtime_init: Apr 20 21:12:48 vm_runtime_init: Done start 'pull_app'.
>> Apr 20 21:12:48 gae-integration--track--gce-20160415t135311-cyl8 
>> vm_runtime_init: Apr 20 21:12:48 starting api_verifier
>> Apr 20 21:12:48 gae-integration--track--gce-20160415t135311-cyl8 
>> vm_runtime_init: Apr 20 21:12:48 checking the status of api_verifier
>> Apr 20 21:12:48 gae-integration--track--gce-20160415t135311-cyl8 kernel: 
>> [   73.106564] IPv6: ADDRCONF(NETDEV_UP): veth0dcc920: link is not ready
>> Apr 20 21:12:48 gae-integration--track--gce-20160415t135311-cyl8 kernel: 
>> [   73.340512] IPv6: ADDRCONF(NETDEV_CHANGE): veth0dcc920: link becomes 
>> ready
>> Apr 20 21:12:48 gae-integration--track--gce-20160415t135311-cyl8 
>> vm_runtime_init: Apr 20 21:12:48 Done pulling app container
>> Apr 20 21:12:48 gae-integration--track--gce-20160415t135311-cyl8 
>> vm_runtime_init: Apr 20 21:12:48 vm_runtime_init: Done start 
>> 'patch_dockerfile'.
>> Apr 20 21:13:03 gae-integration--track--gce-20160415t135311-cyl8 kernel: 
>> [   88.380041] docker0: port 2(veth0dcc920) entered forwarding state
>> Apr 20 21:13:42 gae-integration--track--gce-20160415t135311-cyl8 init: 
>> Switching to runlevel: 0
>> Apr 20 21:13:42 gae-integration--track--gce-20160415t135311-cyl8 
>> shutdown[3370]: shutting down for system halt
>> Apr 20 21:13:42 gae-integration--track--gce-20160415t135311-cyl8 
>> shutdown[3352]: shutting down for system halt
>> Apr 20 21:13:43 gae-integration--track--gce-20160415t135311-cyl8 
>> shutdownscript: Running shutdown script /var/run/google.shutdown.script
>> Apr 20 21:13:43 gae-integration--track--gce-20160415t135311-cyl8 iscsid: 
>> iscsid shutting down.
>> Apr 20 21:13:43 gae-integration--track--gce-20160415t135311-cyl8 
>> shutdownscript: Finished running shutdown script 
>> /var/run/google.shutdown.script
>> Apr 20 21:13:43 gae-integration--track--gce-20160415t135311-cyl8 google: 
>> shutdown script found in metadata.
>> ---------------------------------------------------------------------
>> --------------------------Tail of app logs---------------------------
>> "  % Total    % Received % Xferd  Average Speed   Time    Time     Time 
>>  Current"
>> ---------------------------------------------------------------------
>> Current app health: 0
>> "---------------App was unhealthy, grabbing debug logs----------------"
>> App logs not found.
>> "                                 Dload  Upload   Total   Spent    Left 
>>  Speed"
>> Sending SIGUSR1 to fluentd to trigger a log flush.
>> App shutdown handlers done.
>> Container CID file not found.
>> ---------------------------------------------------------------------
>> "
>> 0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0curl: 
>> (7) couldn't connect to host"
>> 47e7078ed32619023f23f15022d4965381f0abc634ea9e48d173e312beb43b51
>> Triggering app shutdown handlers.
>> -------------------------VM shutdown initiated------------------------
>>
>>
>>
>>
>>
>>

-- 
You received this message because you are subscribed to the Google Groups 
"Google App Engine" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at https://groups.google.com/group/google-appengine.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/google-appengine/013ea976-add4-4af2-8a65-7c41c8b718f3%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

[google-appengine] Re: Flexible environment auto-scaling: Instances stopped after ~2 minutes, before app even runs

Reply via email to