As part of a technical paper we are writing on ganglia we need to quantify
measurements and experiences people have found with ganglia.  Another who
provides these details will receive clear kudos in the "Acknowledgments"  
section of the paper.  Because of the time pressures involved in this
request, I will personally donate six hours of my time to build any custom
ganglia components you may need for your particular application if you
answer all the following questions before this sunday.  We know of
ganglia's successful use on large clusters, grids and planetary scale
systems but we need some hard numbers and information about experience.

As part of quantifying Ganglia's scalability and performance overheads,
what we'd like is to have measurements on three real distributed systems
at different points in the architectural design space (clusters, Grids,
planetary-scale systems).  Without a good evaluation and experience
section based on this data, I can see no way of this paper possibly
getting into HPDC.  What we need is the following:

  (1) Measurements of local overhead (i.e., within a node)

      Table 1: per-node overheads for three actual systems
      (Note that these are data points in scaling graphs in (3) below)

      Each row will correspond to a particular example of a different
      class of distributed system (cluster, grid, planetary-scale system)

      For each system, we need: CPU, physical and virtual memory
      footprints, and I/O overhead.

      For each system, we need: precise size and configuration
      of each system (# nodes, node config, network config, etc.).
      
  (2) Measurements of "global" overhead (i.e., between nodes)

      Table 1: per-node overheads for three actual systems
      (Note that these are data points in scaling graphs in (3) below)

      Each row will correspond to a particular example of a different
      class of distributed system (cluster, grid, planetary-scale system)

      For each system, we need: local-area BW consumed and wide-area
      BW consumed (the latter doesn't apply to single cluster case obviously)

  (3) Measurements of local and "global" overhead as #nodes/#sites scales

      For the single cluster case, we need:

         Figures 1,2,3,4 (might be combined, although that could get
         too busy): CPU, physical and virtual memory footprints, and
         I/O overhead as a function of number of nodes.

         Figure 5: local-area BW consumed as a function of number of
         nodes.

      For the Grid and planetary-scale systems cases, we would like to
      have:

         Figure 1: local-area BW consumed as a function of number of
         nodes/sites.
         Figure 2: wide-area BW consumed as a function of number of
         nodes/sites.

For the experience section, what we need is the following for each of
the three systems:

   (1) Top three things that didn't work so well with Ganglia.

       Note this is not Ganglia bashing.  These are lessons learned
       from practical deployment and usage on real distributed
       systems.  It's clear Ganglia has evolved from its initial
       design point of monitoring within a single cluster to now
       support Grids and even planetary-scale systems like PlanetLab.
       However, it wasn't originally designed for that.  Hence, it's
       expected that we will need to revisit some original design
       issues and trade-offs made.

   (2) Top three things that worked well with Ganglia.

   (3) Top three major changes / feedback sent back in response 
       to (1) and what specifically was done about it. 

       Example: Ganglia's assumption of abundant wide-area network
       bandwidth between sites was challenged in PlanetLab. 1 GB per
       week per site across 42 sites is expensive over the public
       Internet. A PlanetLab site in the UK, for example, estimated
       that BW alone would cost them $10,000 USD per year and almost
       20% of that was Ganglia alone. In response to these BW issues,
       we've since added compression using zlib which has resulted in
       approximately a 10x reduction in wide-area BW consumed.

Thanks in advance for your time!
-matt


Reply via email to