Repository: incubator-brooklyn Updated Branches: refs/heads/master 726cae34d -> a7b3d8e99
Adds docs ops/troubleshooting - Moves troubleshooting-connectivity from dev/tips to ops - Adds troubleshooting guides for: - runtime-errors - deployment - software process Project: http://git-wip-us.apache.org/repos/asf/incubator-brooklyn/repo Commit: http://git-wip-us.apache.org/repos/asf/incubator-brooklyn/commit/3d29c8b5 Tree: http://git-wip-us.apache.org/repos/asf/incubator-brooklyn/tree/3d29c8b5 Diff: http://git-wip-us.apache.org/repos/asf/incubator-brooklyn/diff/3d29c8b5 Branch: refs/heads/master Commit: 3d29c8b5db0925149181853f5380ea00add57f82 Parents: 2013f2f Author: Aled Sage <[email protected]> Authored: Thu Jul 23 19:47:23 2015 -0700 Committer: Alex Heneveld <[email protected]> Committed: Fri Jul 24 15:24:58 2015 +0100 ---------------------------------------------------------------------- docs/guide/dev/index.md | 2 - .../dev/tips/troubleshooting-connectivity.md | 143 ------------------- docs/guide/ops/index.md | 1 + .../images/failed-task-large.png | Bin 0 -> 169079 bytes .../images/jmx-sensors-large.png | Bin 0 -> 197177 bytes docs/guide/ops/troubleshooting/index.md | 11 ++ .../troubleshooting-connectivity.md | 143 +++++++++++++++++++ .../troubleshooting-deployment.md | 88 ++++++++++++ .../troubleshooting-runtime-errors.md | 116 +++++++++++++++ .../troubleshooting-softwareprocess.md | 50 +++++++ 10 files changed, 409 insertions(+), 145 deletions(-) ---------------------------------------------------------------------- http://git-wip-us.apache.org/repos/asf/incubator-brooklyn/blob/3d29c8b5/docs/guide/dev/index.md ---------------------------------------------------------------------- diff --git a/docs/guide/dev/index.md b/docs/guide/dev/index.md index aa04ef9..0a7acfd 100644 --- a/docs/guide/dev/index.md +++ b/docs/guide/dev/index.md @@ -14,8 +14,6 @@ children: - tips/ - tips/logging.md - tips/debugging-remote-brooklyn.md -- tips/troubleshooting-exceptions.md -- tips/troubleshooting-connectivity.md - rest/rest-api-doc.md --- http://git-wip-us.apache.org/repos/asf/incubator-brooklyn/blob/3d29c8b5/docs/guide/dev/tips/troubleshooting-connectivity.md ---------------------------------------------------------------------- diff --git a/docs/guide/dev/tips/troubleshooting-connectivity.md b/docs/guide/dev/tips/troubleshooting-connectivity.md deleted file mode 100644 index 07874c0..0000000 --- a/docs/guide/dev/tips/troubleshooting-connectivity.md +++ /dev/null @@ -1,143 +0,0 @@ ---- -layout: website-normal -title: Troubleshooting Server Connectivity Issues in the Cloud -toc: /guide/toc.json ---- - -A common problem when setting up an application in the cloud is getting the basic connectivity right - how -do I get my service (e.g. a TCP host:port) publicly accessible over the internet? - -This varies a lot - e.g. Is the VM public or in a private network? Is the service only accessible through -a load balancer? Should the service be globally reachable or only to a particular CIDR? - -This guide gives some general tips for debugging connectivity issues, which are applicable to a -range of different service types. Choose those that are appropriate for your use-case. - -## VM reachable -If the VM is supposed to be accessible directly (e.g. from the public internet, or if in a private network -then from a jump host)... - -### ping -Can you `ping` the VM from the machine you are trying to reach it from? - -However, ping is over ICMP. If the VM is unreachable, it could be that the firewall forbids ICMP but still -lets TCP traffic through. - -### telnet to TCP port -You can check if a given TCP port is reachable and listening using `telnet <host> <port>`, such as -`telnet www.google.com 80`, which gives output like: - -``` - Trying 31.55.163.219... - Connected to www.google.com. - Escape character is '^]'. -``` - -If this is very slow to respond, it can be caused by a firewall blocking access. If it is fast, it could -be that the server is just not listening on that port. - -### DNS and routing -If using a hostname rather than IP, then is it resolving to a sensible IP? - -Is the route to the server sensible? (e.g. one can hit problems with proxy servers in a corporate -network, or ISPs returning a default result for unknown hosts). - -The following commands can be useful: - -* `host` is a DNS lookup utility. e.g. `host www.google.com`. -* `dig` stands for "domain information groper". e.g. `dig www.google.com`. -* `traceroute` prints the route that packets take to a network host. e.g. `traceroute www.google.com`. - -## Service is listening - -### Service responds -Try connecting to the service from the VM itself. For example, `curl http://localhost:8080` for a -web-service. - -On dev/test VMs, don't be afraid to install the utilities you need such as `curl`, `telnet`, `nc`, -etc. Cloud VMs often have a very cut-down set of packages installed. For example, execute -`sudo apt-get update; sudo apt-get install -y curl` or `sudo yum install -y curl`. - -### Listening on port -Check that the service is listening on the port, and on the correct NIC(s). - -Execute `netstat -antp` (or on OS X `netstat -antp TCP`) to list the TCP ports in use (or use -`-anup` for UDP). You should expect to see the something like the output below for a service. - -``` -Proto Recv-Q Send-Q Local Address Foreign Address State PID/Program name -tcp 0 0 :::8080 :::* LISTEN 8276/java -``` - -In this case a Java process with pid 8276 is listening on port 8080. The local address `:::8080` -format means all NICs (in IPv6 address format). You may also see `0.0.0.0:8080` for IPv4 format. -If it says 127.0.0.1:8080 then your service will most likely not be reachable externally. - -Use `ip addr show` (or the obsolete `ifconfig -a`) to see the network interfaces on your server. - -For `netstat`, run with `sudo` to see the pid for all listed ports. - -## Firewalls -On Linux, check if `iptables` is preventing the remote connection. On Windows, check the Windows Firewall. - -If it is acceptable (e.g. it is not a server in production), try turning off the firewall temporarily, -and testing connectivity again. Remember to re-enable it afterwards! On CentOS, this is `sudo service -iptables stop`. On Ubuntu, use `sudo ufw disable`. On Windows, press the Windows key and type 'Windows -Firewall with Advanced Security' to open the firewall tools, then click 'Windows Firewall Properties' -and set the firewall state to 'Off' in the Domain, Public and Private profiles. - -If you cannot temporarily turn off the firewall, then look carefully at the firewall settings. For -example, execute `sudo iptables -n --list` and `iptables -t nat -n --list`. - -## Cloud firewalls -Some clouds offer a firewall service, where ports need to be explicitly listed to be reachable. - -For example, [security groups for EC2-classic] -(http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/using-network-security.html#ec2-classic-security-groups) -have rules for the protocols and ports to be reachable from specific CIDRs. - -Check these settings via the cloud provider's web-console (or API). - -## Quick test of a listener port -It can be useful to start listening on a given port, and to then check if that port is reachable. -This is useful for testing basic connectivity when your service is not yet running, or to a -different port to compare behaviour, or to compare with another VM in the network. - -The `nc` netcat tool is useful for this. For example, `nc -l 0.0.0.0 8080` will listen on port -TCP 8080 on all network interfaces. On another server, you can then run `echo hello from client -| nc <hostname> 8080`. If all works well, this will send "hello from client" over the TCP port 8080, -which will be written out by the `nc -l` process before exiting. - -Similarly for UDP, you use `-lU`. - -You may first have to install `nc`, e.g. with `sudo yum install -y nc` or `sudo apt-get install netcat`. - -### Cloud load balancers -For some use-cases, it is good practice to use the load balancer service offered by the cloud provider -(e.g. [ELB in AWS](http://aws.amazon.com/elasticloadbalancing/) or the [Cloudstack Load Balancer] -(http://docs.cloudstack.apache.org/projects/cloudstack-installation/en/latest/network_setup.html#management-server-load-balancing)) - -The VMs can all be isolated within a private network, with access only through the load balancer service. - -Debugging techniques here include ensuring connectivity from another jump server within the private -network, and careful checking of the load-balancer configuration from the Cloud Provider's web-console. - -### DNAT -Use of DNAT is appropriate for some use-cases, where a particular port on a particular VM is to be -made available. - -Debugging connectivity issues here is similar to the steps for a cloud load balancer. Ensure -connectivity from another jump server within the private network. Carefully check the NAT rules from -the Cloud Provider's web-console. - -### Guest wifi -It is common for guest wifi to restrict access to only specific ports (e.g. 80 and 443, restricting -ssh over port 22 etc). - -Normally your best bet is then to abandon the guest wifi (e.g. to tether to a mobile phone instead). - -There are some unconventional workarounds such as [configuring sshd to listen on port 80 so you can -use an ssh tunnel](http://askubuntu.com/questions/107173/is-it-possible-to-ssh-through-port-80). -However, the firewall may well inspect traffic so sending non-http traffic over port 80 may still fail. - - http://git-wip-us.apache.org/repos/asf/incubator-brooklyn/blob/3d29c8b5/docs/guide/ops/index.md ---------------------------------------------------------------------- diff --git a/docs/guide/ops/index.md b/docs/guide/ops/index.md index dae3071..1cb28aa 100644 --- a/docs/guide/ops/index.md +++ b/docs/guide/ops/index.md @@ -11,6 +11,7 @@ children: - high-availability.md - catalog/ - logging.md +- troubleshooting/ --- {% include list-children.html %} \ No newline at end of file http://git-wip-us.apache.org/repos/asf/incubator-brooklyn/blob/3d29c8b5/docs/guide/ops/troubleshooting/images/failed-task-large.png ---------------------------------------------------------------------- diff --git a/docs/guide/ops/troubleshooting/images/failed-task-large.png b/docs/guide/ops/troubleshooting/images/failed-task-large.png new file mode 100644 index 0000000..1c264c4 Binary files /dev/null and b/docs/guide/ops/troubleshooting/images/failed-task-large.png differ http://git-wip-us.apache.org/repos/asf/incubator-brooklyn/blob/3d29c8b5/docs/guide/ops/troubleshooting/images/jmx-sensors-large.png ---------------------------------------------------------------------- diff --git a/docs/guide/ops/troubleshooting/images/jmx-sensors-large.png b/docs/guide/ops/troubleshooting/images/jmx-sensors-large.png new file mode 100644 index 0000000..d9322c6 Binary files /dev/null and b/docs/guide/ops/troubleshooting/images/jmx-sensors-large.png differ http://git-wip-us.apache.org/repos/asf/incubator-brooklyn/blob/3d29c8b5/docs/guide/ops/troubleshooting/index.md ---------------------------------------------------------------------- diff --git a/docs/guide/ops/troubleshooting/index.md b/docs/guide/ops/troubleshooting/index.md new file mode 100644 index 0000000..ca0b8a9 --- /dev/null +++ b/docs/guide/ops/troubleshooting/index.md @@ -0,0 +1,11 @@ +--- +title: Troubleshooting +layout: website-normal +children: +- troubleshooting-runtime-errors.md +- troubleshooting-deployment.md +- troubleshooting-softwareprocess.md +- troubleshooting-connectivity.md +--- + +{% include list-children.html %} \ No newline at end of file http://git-wip-us.apache.org/repos/asf/incubator-brooklyn/blob/3d29c8b5/docs/guide/ops/troubleshooting/troubleshooting-connectivity.md ---------------------------------------------------------------------- diff --git a/docs/guide/ops/troubleshooting/troubleshooting-connectivity.md b/docs/guide/ops/troubleshooting/troubleshooting-connectivity.md new file mode 100644 index 0000000..07874c0 --- /dev/null +++ b/docs/guide/ops/troubleshooting/troubleshooting-connectivity.md @@ -0,0 +1,143 @@ +--- +layout: website-normal +title: Troubleshooting Server Connectivity Issues in the Cloud +toc: /guide/toc.json +--- + +A common problem when setting up an application in the cloud is getting the basic connectivity right - how +do I get my service (e.g. a TCP host:port) publicly accessible over the internet? + +This varies a lot - e.g. Is the VM public or in a private network? Is the service only accessible through +a load balancer? Should the service be globally reachable or only to a particular CIDR? + +This guide gives some general tips for debugging connectivity issues, which are applicable to a +range of different service types. Choose those that are appropriate for your use-case. + +## VM reachable +If the VM is supposed to be accessible directly (e.g. from the public internet, or if in a private network +then from a jump host)... + +### ping +Can you `ping` the VM from the machine you are trying to reach it from? + +However, ping is over ICMP. If the VM is unreachable, it could be that the firewall forbids ICMP but still +lets TCP traffic through. + +### telnet to TCP port +You can check if a given TCP port is reachable and listening using `telnet <host> <port>`, such as +`telnet www.google.com 80`, which gives output like: + +``` + Trying 31.55.163.219... + Connected to www.google.com. + Escape character is '^]'. +``` + +If this is very slow to respond, it can be caused by a firewall blocking access. If it is fast, it could +be that the server is just not listening on that port. + +### DNS and routing +If using a hostname rather than IP, then is it resolving to a sensible IP? + +Is the route to the server sensible? (e.g. one can hit problems with proxy servers in a corporate +network, or ISPs returning a default result for unknown hosts). + +The following commands can be useful: + +* `host` is a DNS lookup utility. e.g. `host www.google.com`. +* `dig` stands for "domain information groper". e.g. `dig www.google.com`. +* `traceroute` prints the route that packets take to a network host. e.g. `traceroute www.google.com`. + +## Service is listening + +### Service responds +Try connecting to the service from the VM itself. For example, `curl http://localhost:8080` for a +web-service. + +On dev/test VMs, don't be afraid to install the utilities you need such as `curl`, `telnet`, `nc`, +etc. Cloud VMs often have a very cut-down set of packages installed. For example, execute +`sudo apt-get update; sudo apt-get install -y curl` or `sudo yum install -y curl`. + +### Listening on port +Check that the service is listening on the port, and on the correct NIC(s). + +Execute `netstat -antp` (or on OS X `netstat -antp TCP`) to list the TCP ports in use (or use +`-anup` for UDP). You should expect to see the something like the output below for a service. + +``` +Proto Recv-Q Send-Q Local Address Foreign Address State PID/Program name +tcp 0 0 :::8080 :::* LISTEN 8276/java +``` + +In this case a Java process with pid 8276 is listening on port 8080. The local address `:::8080` +format means all NICs (in IPv6 address format). You may also see `0.0.0.0:8080` for IPv4 format. +If it says 127.0.0.1:8080 then your service will most likely not be reachable externally. + +Use `ip addr show` (or the obsolete `ifconfig -a`) to see the network interfaces on your server. + +For `netstat`, run with `sudo` to see the pid for all listed ports. + +## Firewalls +On Linux, check if `iptables` is preventing the remote connection. On Windows, check the Windows Firewall. + +If it is acceptable (e.g. it is not a server in production), try turning off the firewall temporarily, +and testing connectivity again. Remember to re-enable it afterwards! On CentOS, this is `sudo service +iptables stop`. On Ubuntu, use `sudo ufw disable`. On Windows, press the Windows key and type 'Windows +Firewall with Advanced Security' to open the firewall tools, then click 'Windows Firewall Properties' +and set the firewall state to 'Off' in the Domain, Public and Private profiles. + +If you cannot temporarily turn off the firewall, then look carefully at the firewall settings. For +example, execute `sudo iptables -n --list` and `iptables -t nat -n --list`. + +## Cloud firewalls +Some clouds offer a firewall service, where ports need to be explicitly listed to be reachable. + +For example, [security groups for EC2-classic] +(http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/using-network-security.html#ec2-classic-security-groups) +have rules for the protocols and ports to be reachable from specific CIDRs. + +Check these settings via the cloud provider's web-console (or API). + +## Quick test of a listener port +It can be useful to start listening on a given port, and to then check if that port is reachable. +This is useful for testing basic connectivity when your service is not yet running, or to a +different port to compare behaviour, or to compare with another VM in the network. + +The `nc` netcat tool is useful for this. For example, `nc -l 0.0.0.0 8080` will listen on port +TCP 8080 on all network interfaces. On another server, you can then run `echo hello from client +| nc <hostname> 8080`. If all works well, this will send "hello from client" over the TCP port 8080, +which will be written out by the `nc -l` process before exiting. + +Similarly for UDP, you use `-lU`. + +You may first have to install `nc`, e.g. with `sudo yum install -y nc` or `sudo apt-get install netcat`. + +### Cloud load balancers +For some use-cases, it is good practice to use the load balancer service offered by the cloud provider +(e.g. [ELB in AWS](http://aws.amazon.com/elasticloadbalancing/) or the [Cloudstack Load Balancer] +(http://docs.cloudstack.apache.org/projects/cloudstack-installation/en/latest/network_setup.html#management-server-load-balancing)) + +The VMs can all be isolated within a private network, with access only through the load balancer service. + +Debugging techniques here include ensuring connectivity from another jump server within the private +network, and careful checking of the load-balancer configuration from the Cloud Provider's web-console. + +### DNAT +Use of DNAT is appropriate for some use-cases, where a particular port on a particular VM is to be +made available. + +Debugging connectivity issues here is similar to the steps for a cloud load balancer. Ensure +connectivity from another jump server within the private network. Carefully check the NAT rules from +the Cloud Provider's web-console. + +### Guest wifi +It is common for guest wifi to restrict access to only specific ports (e.g. 80 and 443, restricting +ssh over port 22 etc). + +Normally your best bet is then to abandon the guest wifi (e.g. to tether to a mobile phone instead). + +There are some unconventional workarounds such as [configuring sshd to listen on port 80 so you can +use an ssh tunnel](http://askubuntu.com/questions/107173/is-it-possible-to-ssh-through-port-80). +However, the firewall may well inspect traffic so sending non-http traffic over port 80 may still fail. + + http://git-wip-us.apache.org/repos/asf/incubator-brooklyn/blob/3d29c8b5/docs/guide/ops/troubleshooting/troubleshooting-deployment.md ---------------------------------------------------------------------- diff --git a/docs/guide/ops/troubleshooting/troubleshooting-deployment.md b/docs/guide/ops/troubleshooting/troubleshooting-deployment.md new file mode 100644 index 0000000..c343762 --- /dev/null +++ b/docs/guide/ops/troubleshooting/troubleshooting-deployment.md @@ -0,0 +1,88 @@ +--- +layout: website-normal +title: Troubleshooting Deployment +toc: /guide/toc.json +--- + +This guide describes common problems encountered when deploying applications. + + +## YAML deployment errors + +The error `Invalid YAML: Plan not in acceptable format: Cannot convert ...` means that the text is not +valid YAML. Common reasons include that the indentation is incorrect, or that there are non-matching +brackets. + +The error `Unrecognized application blueprint format: no services defined` means that the `services:` +section is missing. + +An error like `Deployment plan item io.brooklyn.camp.spi.pdp.Service@23c159e2[name=<null>,description=<null>,serviceType=com.acme.Foo,characteristics=[],customAttributes={}] cannot be matched` means that the given entity type (in this case com.acme.Foo) is not in the catalog or on the classpath. + +An error like `Illegal parameter for 'location' (aws-ec3); not resolvable: java.util.NoSuchElementException: Unknown location 'aws-ec3': either this location is not recognised or there is a problem with location resolver configuration` means that the given location (in this case aws-ec3) +was unknown. This means it does not match any of the named locations in brooklyn.properties, nor any of the +clouds enabled in the jclouds support, nor any of the locations added dynamically through the catalog API. + + +## VM Provisioning Failures + +There are many stages at which VM provisioning can fail! An error `Failure running task provisioning` +means there was some problem obtaining or connecting to the machine. + +An error like `... Not authorized to access cloud ...` usually means the wrong identity/credential was used. + +An error like `Unable to match required VM template constraints` means that a matching image (e.g. AMI in AWS terminology) could not be found. This +could be because an incorrect explicit image id was supplied, or because the match-criteria could not +be satisfied using the given images available in the given cloud. The first time this error is +encountered, a listing of all images in that cloud/region will be written to the debug log. + +Failure to form an ssh connection to the newly provisioned VM can be reported in several different ways, +depending on the nature of the error. This breaks down into failures at different points: + +* Failure to reach the ssh port (e.g. `... could not connect to any ip address port 22 on node ...`). +* Failure to do the very initial ssh login (e.g. `... Exhausted available authentication methods ...`). +* Failure to ssh using the newly created user. + +There are many possible reasons for this ssh failure, which include: + +* The VM was "dead on arrival" (DOA) - sometimes a cloud will return an unusable VM. One can work around + this using the `machineCreateAttempts` configuration option, to automatically retry with a new VM. +* Local network restrictions. On some guest wifis, external access to port 22 is forbidden. + Check by manually trying to reach port 22 on a different machine that you have access it. +* NAT rules not set up correctly. On some clouds that have only private IPs, Brooklyn can automatically + create NAT rules to provide access to port 22. If this NAT rule creation fails for some reason, + then Brooklyn will not be able to reach the VM. If NAT rules are being created for your cloud, then + check the logs for warnings or errors about the NAT rule creation. +* ssh credentials incorrectly configured. The Brooklyn configuration is very flexible in how ssh + credentials can be configured. However, if a more advanced configuration is used incorrectly (e.g. + the wrong login user, or invalid ssh keys) then this will fail. +* Wrong login user. The initial login user to use when first logging into the new VM is inferred from + the metadata provided by the cloud provider about that image. This can sometimes be incomplete, so + the wrong user may be used. This can be explicitly set using the `loginUser` configuration option. + An example of this is with some Ubuntu VMs, where the "ubuntu" user should be used. However, on some clouds + it defaults to trying to ssh as "root". +* Bad choice of user. By default, Brooklyn will create a user with the same name as the user running the + Brooklyn process; the choice of user name is configurable. If this user already exists on the machine, + then the user setup will not behave as expected. Subsequent attempts to ssh using this user could then fail. +* Custom credentials on the VM. Most clouds will automatically set the ssh login details (e.g. in AWS using + the key-pair, or in CloudStack by auto-generating a password). However, with some custom images the VM + will have hard-coded credentials that must be used. If Brooklyn's configuration does not match that, + then it will fail. +* Guest customisation by the cloud. On some clouds (e.g. vCloud Air), the VM can be configured to do + guest customisation immediately after the VM starts. This can include changing the root password. + If Brooklyn is not configured with the expected changed password, then the VM provisioning may fail + (depending if Brooklyn connects before or after the password is changed!). + +A very useful debug configuration is to set `destroyOnFailure` to false. This will allow ssh failures to +be more easily investigated. + + +## Timeout Waiting For Service-Up + +A common generic error message is that there was a timeout waiting for service-up. + +This just means that the entity did not get to service-up in the pre-defined time period (the default is +two minutes, and can be configured using the `start.timeout` config key; the timer begins after the +start tasks are completed). + +See the guide on [runtime errors](troubleshooting-runtime-errors.html) for where to find additional information, especially the section on +"Entity's Error Status". http://git-wip-us.apache.org/repos/asf/incubator-brooklyn/blob/3d29c8b5/docs/guide/ops/troubleshooting/troubleshooting-runtime-errors.md ---------------------------------------------------------------------- diff --git a/docs/guide/ops/troubleshooting/troubleshooting-runtime-errors.md b/docs/guide/ops/troubleshooting/troubleshooting-runtime-errors.md new file mode 100644 index 0000000..8b657fc --- /dev/null +++ b/docs/guide/ops/troubleshooting/troubleshooting-runtime-errors.md @@ -0,0 +1,116 @@ +--- +layout: website-normal +title: Troubleshooting Runtime Errors +toc: /guide/toc.json +--- + +This guide describes sources of information for runtime errors. + +Whether you're customizing out-of-the-box blueprints, or developing your own custom blueprints, you will +inevitably have to deal with entity failure. Thankfully Brooklyn provides plenty of information to help +you locate and resolve any issues you may encounter. + + +## Web-console Runtime Error Information + +### Entity Hierarchy + +The Brooklyn web-console includes a tree view of the entities within an application. Errors within the +application are represented visually, showing a "fire" image on the entity. + +When an error causes an entire application to be unexpectedly down, the error is generally propagated to the +top-level entity - i.e. marking it as "on fire". To find the underlying error, one should expand the entity +hierarchy tree to find the specific entities that have actually failed. + + +### Entity's Error Status + +Many entities have some common sensors (i.e. attributes) that give details of the error status: + +* `service.isUp` (often referred to as "service up") is a boolean, saying whether the service is up. For many + software processes, this is inferred from whether the "service.notUp.indicators" is empty. It is also + possible for some entities to set this attribute directly. +* `service.notUp.indicators` is a map of errors. This often gives much more information than the single + `service.isUp` attribute. For example, there may be many health-check indicators for a component: + is the root URL reachable, it the management api reporting healthy, is the process running, etc. +* `service.problems` is a map of namespaced indicators of problems with a service. +* `service.state` is the actual state of the service - e.g. CREATED, STARTING, RUNNING, STOPPING, STOPPED, + DESTROYED and ON_FIRE. +* `service.state.expected` indicates the state the service is expected to be in (and when it transitioned to that). + For example, is the service expected to be starting, running, stopping, etc. + +These sensor values are shown in the "sensors" tab - see below. + + +### Sensors View + +The "Sensors" tab in the Brooklyn web-console shows the attribute values of a particular entity. +This gives lots of runtime information, including about the health of the entity - the +set of attributes will vary between different entity types. + +[](images/jmx-sensors-large.png) + +Note that null (or not set) sensors are hidden by default. You can click on the `Show/hide empty records` +icon (highlighted in yellow above) to see these sensors as well. + +The sensors view is also tabulated. You can configure the numbers of sensors shown per page +(at the bottom). There is also a search bar (at the top) to filter the sensors shown. + + +### Activity View + +The activity view shows the tasks executed by a given entity. The top-level tasks are the effectors +(i.e. operations) invoked on that entity. This view allows one to drill into the task, to +see details of errors. + +Select the entity, and then click on the `Activities` tab. + +In the table showing the tasks, each row is a link - clicking on the row will drill into the details of that task, +including sub-tasks: + +[](images/failed-task-large.png) + +For ssh tasks, this allows one to drill down to see the env, stdin, stdout and stderr. That is, you can see the +commands executed (stdin) and environment variables (env), and the output from executing that (stdout and stderr). + +For tasks that did not fail, one can still drill into the tasks to see what was done. + +It's always worth looking at the Detailed Status section as sometimes that will give you the information you need. +For example, it can show the exception stack trace in the thread that was executing the task that failed. + + +## Log Files + +Brooklyn's logging is configurable, for the files created, the logging levels, etc. +See [Logging docs](/guide/ops/logging.html). + +With out-of-the-box logging, `brooklyn.info.log` and `brooklyn.debug.log` files are created. These are by default +rolling log files: when the log reaches a given size, it is compressed and a new log file is started. +Therefore check the timestamps of the log files to ensure you are looking in the correct file for the +time of your error. + +With out-of-the-box logging, info, warnings and errors are written to the `brooklyn.info.log` file. This gives +a summary of the important actions and errors. However, it does not contain full stacktraces for errors. + +To find the exception, we'll need to look in Brooklyn's debug log file. By default, the debug log file +is named `brooklyn.debug.log`. You can use your favourite tools for viewing large text files. + +One possible tool is `less`, e.g. `less brooklyn.debug.log`. We can quickly find the last exception +by navigating to the end of the log file (using `Shift-G`), then performing a reverse-lookup by typing `?Exception` +and pressing `Enter`. Sometimes an error results in multiple exceptions being logged (e.g. first for the +entity, then for the cluster, then for the app). If you know the text of the error message (e.g. copy-pasted +from the Activities view of the web-console) then one can search explicitly for that text. + +The `grep` command is also extremely helpful. Useful things to grep for include: + +* The entity id (see the "summary" tab of the entity in the web-console for the id). +* The entity type name (if there are only a small number of entities of that type). +* The VM IP address. +* A particular error message (e.g. copy-pasted from the Activities view of the web-console). +* The word WARN etc, such as `grep -E "WARN|ERROR" brooklyn.info.log`. + +Grep'ing for particular log messages is also useful. Some examples are shown below: + +* INFO: "Started application", "Stopping application" and "Stopped application" +* INFO: "Creating VM " +* DEBUG: "Finished VM " http://git-wip-us.apache.org/repos/asf/incubator-brooklyn/blob/3d29c8b5/docs/guide/ops/troubleshooting/troubleshooting-softwareprocess.md ---------------------------------------------------------------------- diff --git a/docs/guide/ops/troubleshooting/troubleshooting-softwareprocess.md b/docs/guide/ops/troubleshooting/troubleshooting-softwareprocess.md new file mode 100644 index 0000000..a09f902 --- /dev/null +++ b/docs/guide/ops/troubleshooting/troubleshooting-softwareprocess.md @@ -0,0 +1,50 @@ +--- +layout: website-normal +title: Troubleshooting SoftwareProcess Entities +toc: /guide/toc.json +--- + +The [guide for troubleshooting runtime errors](troubleshooting-runtime-errors.html) in Brooklyn gives +information for how to find more information about errors. + +If that doesn't give enough information to diagnose, fix or workaround the problem, then it can be required +to login to the machine, to investigate further. This guide applies to entities that are types +of "SoftwareProcess" in Brooklyn, or that follows those conventions. + + +## VM connection details + +The ssh connection details for an entity is published to a sensor `host.sshAddress`. The login +credentials will depend on the Brooklyn configuration. The default is to use the `~/.ssh/id_rsa` +or `~/.ssh/id_dsa` on the Brooklyn host (uploading the associated `~/.ssh/id_rsa.pub` to the machine's +authorised_keys). However, this can be overridden (e.g. with specific passwords etc) in the +location's configuration. + +For Windows, there is a similar sensor with the name `host.winrmAddress`. (TODO sensor for password?) + + +## Install and Run Directories + +For ssh-based software processes, the install directory and the run directory are published as sensors +`install.dir` and `run.dir` respectively. + +For some entities, files are unpacked into the install dir; configuration files are written to the +run dir along with log files. For some other entities, these directories may be mostly empty - +e.g. if installing RPMs, and that software writes its logs to a different standard location. + +Most entities have a sensor `log.location`. It is generally worth checking this, along with other files +in the run directory (such as console output). + + +## Process and OS Health + +It is worth checking that the process is running, e.g. using `ps aux` to look for the desired process. +Some entities also write the pid of the process to `pid.txt` in the run directory. + +It is also worth checking if the required port is accessible. This is discussed in the guide +"Troubleshooting Server Connectivity Issues in the Cloud", including listing the ports in use: +execute `netstat -antp` (or on OS X `netstat -antp TCP`) to list the TCP ports in use (or use +`-anup` for UDP). + +It is also worth checking the disk space on the server, e.g. using `df -m`, to check that there +is sufficient space on each of the required partitions.
