Re: [openstack-dev] [new][cloudpulse] Announcing a project to HealthCheck OpenStack deployments

David Kranz Wed, 13 May 2015 08:34:28 -0700

On 05/13/2015 09:51 AM, Simon Pasquier wrote:

On Wed, May 13, 2015 at 3:27 PM, David Kranz <dkr...@redhat.com<mailto:dkr...@redhat.com>> wrote:


    On 05/13/2015 09:06 AM, Simon Pasquier wrote:

    Hello,

    Like many others commented before, I don't quite understand how
    unique are the Cloudpulse use cases.

    For operators, I got the feeling that existing solutions fit well:
    - Traditional monitoring tools (Nagios, Zabbix, ....) are
    necessary anyway for infrastructure monitoring (CPU, RAM, disks,
    operating system, RabbitMQ, databases and more) and diagnostic
    purposes. Adding OpenStack service checks is fairly easy if you
    already have the toolchain.

    Is it really so easy? Rabbitmq has an "aliveness" test that is
    easy to hook into. I don't know exactly what it does, other than
    what the doc says, but I should not have to. If I want my standard
    monitoring system to call into a cloud and ask "is nova healthy?",
    "is glance healthy?", etc. are their such calls?

Regarding RabbitMQ aliveness test, it has its own limits (more on thatlatter, I've got an "interesting" RabbitMQ outage that I'm going todiscuss in a new thread) and it doesn't replicate exactly what theclients (eg OpenStack services) are doing.

I'm sure it has limits but my point was that the developers of rabbitmqunderstood that it would be difficult for users to know exactly whatshould be poked at inside to check health, so they provide a call to do it.

Regarding the service checks, there are already plenty of scripts thatexist for Nagios, Collectd and so on. Some of them are listed in theWiki [1].

I understand and that is what I meant by "after-market". If some oneputs a new feature in service X, that requires some monitoring to behealthy, then all those different scripts need to chase after it to keepup to date. Poking at service internals to check the health of a serviceis an abstraction violation. As some one on this thread said,tempest/rally can be used to check a certain kind of health but it isakin to black-box testing whereas health monitoring should be more akinto whitebox-testing.

    There are various sets of calls associated with nagios, zabbix,
    etc. but those seem like "after-market" parts for a car. Seems to
    me the services themselves would know best how to check if they
    are healthy, particularly as that could change version to version.
    Has their been discussion of adding a health-check (admin) api in
    each service? Lacking that, is there documentation from any
    OpenStack projects about "how to check the health of nova"? When I
    saw this thread start, that is what I thought it was going to be
    about.
Starting with Kilo, you could configure your OpenStack API serviceswith the healthcheck middleware [2]. This has been inspired by whatSwift's been doing for some time now [3].IIUC the default healthcheckis minimalist and doesn't check that dependent services (likeRabbitMQ, database) are healthy but the framework is extensible andmore healthchecks can be added.

I can see that but the real value would be in abstracting the details ofwhat it means for a service to be healthy inside the implementation andexporting an api. If that were present, the question of whether callingit used middleware or not would be secondary. I'm not sure what thevalue-add of middleware would be in this case.


 -David


     -David


BR,
Simon

[1]https://wiki.openstack.org/wiki/Operations/Tools#Monitoring_and_Trending[2]http://docs.openstack.org/developer/oslo.middleware/api.html#oslo_middleware.Healthcheck[3]http://docs.openstack.org/kilo/config-reference/content/object-storage-healthcheck.html

    - OpenStack projects like Rally or Tempest can generate synthetic
    loads and run end-to-end tests. Integrating them with a
    monitoring system isn't terribly difficult either.

    As far as Monitoring-as-a-service is concerned, do you have plans
    to integrate/leverage Ceilometer?

    BR,
    Simon

    On Tue, May 12, 2015 at 7:20 PM, Vinod Pandarinathan (vpandari)
    <vpand...@cisco.com <mailto:vpand...@cisco.com>> wrote:

        Hello,

          I'm pleased to announce the development of a new project
        called CloudPulse.  CloudPulse provides Openstack
        health-checking services to both operators, tenants, and
        applications. This project will begin as
        a StackForge project based upon an empty cookiecutter[1]
        repo. The repos to work in are:
        Server: https://github.com/stackforge/cloudpulse
        Client: https://github.com/stackforge/python-cloudpulseclient

        Please join us via iRC on #openstack-cloudpulse on freenode.

        I am holding a doodle poll to select times for our first
        meeting the week after summit.  This doodle poll will close
        May 24th and meeting times will be announced on the mailing
        list at that time.  At our first IRC meeting,
        we will draft additional core team members, so if your
        interested in joining a fresh new development effort, please
        attend our first meeting.
        Please take a moment if your interested in CloudPulse to fill
        out the doodle poll here:

        https://doodle.com/kcpvzy8kfrxe6rvb

        The initial core team is composed of
        Ajay Kalambur,
        Behzad Dastur, Ian Wells, Pradeep chandrasekhar, Steven
        DakeandVinod Pandarinathan.
        I expect more members to join during our initial meeting.

         A little bit about CloudPulse:
         Cloud operators need notification of OpenStack failures
        before a customer reports the failure. Cloud operators can
        then take timely corrective actions with minimal disruption
        to applications. Many cloud applications, including
        those I am interested in (NFV) have very stringent service
        level agreements.  Loss of service can trigger contractual
        costs associated with the service.  Application high
        availability requires an operational OpenStack Cloud, and the
        reality
        is that occascionally OpenStack clouds fail in some
        mysterious ways.  This project intends to identify when those
        failures
        occur so corrective actions may be taken by operators,
        tenants, and the applications themselves.

        OpenStack is considered healthy when OpenStack API services
        respond appropriately. Further OpenStack is
        healthy when network traffic can be sent between the tenant
        networks and can access the Internet.  Finally OpenStack
        is healthy when all infrastructure cluster elements are in an
        operational state.

        For information about blueprints check out:
        https://blueprints.launchpad.net/cloudpulse
        https://blueprints.launchpad.net/python-cloudpulseclient

        For more details, check out our Wiki:
        https://wiki.openstack.org/wiki/Cloudpulse

        Plase join the CloudPulse team in designing and implementing
        a world-class Carrier Grade system for checking
        the health of OpenStack clouds.  We look forward to seeing
        you on IRC on #openstack-cloudpulse.

        Regards,
        Vinod Pandarinathan
        [1] https://github.com/openstack-dev/cookiecutter


        
__________________________________________________________________________
        OpenStack Development Mailing List (not for usage questions)
        Unsubscribe:
        openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
        <http://openstack-dev-requ...@lists.openstack.org?subject:unsubscribe>
        http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev




    __________________________________________________________________________
    OpenStack Development Mailing List (not for usage questions)
    Unsubscribe:openstack-dev-requ...@lists.openstack.org?subject:unsubscribe  
<mailto:openstack-dev-requ...@lists.openstack.org?subject:unsubscribe>
    http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev



    __________________________________________________________________________
    OpenStack Development Mailing List (not for usage questions)
    Unsubscribe:
    openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
    <http://openstack-dev-requ...@lists.openstack.org?subject:unsubscribe>
    http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev




__________________________________________________________________________
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev

__________________________________________________________________________
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev

Re: [openstack-dev] [new][cloudpulse] Announcing a project to HealthCheck OpenStack deployments

Reply via email to