Hi Kai,

Kai Backman wrote:
Hi everyone,

We are working on designing a build cluster for the OpenOffice
BuildBot. Our goal is to have a farm of machines that community
contributors can use to quickly do (distributed) builds of OpenOffice.

I've heard that Sun Release Engineering has a build cluster in use.
We would love to know more about the cluster to help us design the
BuildBot one.

So here is a bunch of questions:
- How many machines are there in the cluster?

About 16 machines, from 2 to 8 processors, plus a number of special purpose machines (for example old machines for building very old code lines) and monitoring machines, test clients etc.

- What hardware/OS are they running?

Fileserver: Sun-Fire 890 with 4 double core 1500 MHz Ultra-Sparc IV
            processors, 16 GBytes RAM, Solaris 10

Storage:    2 x Sun 3511 Storages with a total raw Capacity
            of 10,8 GBytes plus some older Storages

Build clients (nodes):
Solaris Sparc: Sun-Fire 880 with 8 x 900 MHz Sparc processors,
               16 GBytes RAM,
               Solaris 8
Solaris Intel: 2 Sun v60x (2 x 3.06 GHz Xeons)
               Solaris 9
Linux:         2 Sun v60x (2 x 3.06 GHz Xeons),
               2 double processor machine (2 x 2,8 GHz Xeons)
               SuSE 7.3
Windows:       6 Sun v60x (2 x 3.06 GHz Xeons),
               2 Sun v20z (2 x 1.6 GHz Opterons),
               Windows XP

We build product and non-products build for most milestones and most platforms. The OS reflects the base line for our builds, thus we need to use old versions of the OS to guarantee a broad set of suitable target platforms. The high number of Windows clients reflects the high pain of doing Windows builds :-).

- How does the network infrastucture work? What is the design and capacity?

Mixed Gigabit/100 MBit network, nothing special. The build clients and the file server are just a normal part of our network.

- How is the shared disk space handled? What type of server/software
are you using?

See above ... we use currently Sun QFS and plan to migrate to ZFS (included in Solaris 10 update 2). The shares are exposed via NFS4 and Samba. We prefer to use NFS on Windows clients, too, because NFS yields a better performance than Samba for our kind of load.

- How do you monitor the cluster? What loads (disk, CPU, network) are
you measuring? How do you measure them?

With our custom distribution software and standard tools. The load on the fileserver is low, the Sun-Fire 890 has ample power.

- What is the bottleneck? Is the cluster CPU, disk, RAM or network bound?

Building on Solaris and Linux is CPU bound (and quite fast), building on Windows is network bound and relative slow. Tasks, like copying back build milestones are disk bound, obviously.

- How does the task distribution software work?

Every build machine hosts 4-16 (depending on number of processors, RAM etc) so called "build clients", that's a kind of daemon. This daemon accepts a job from the "build master". Each job consists of building a directory by spawning dmake and returning the results. The "build master" maintains the queues, determines which directories can be build now, distributes the job, and accepts the results. The current build can be viewed and controlled via a nice GUI from either the "build master" or a "build slave". "Build slaves" are subordinated copies of the "build master", so that several release engineers can have a look on the build concurrently.

- How does the cluster handle nodes dying during a build?

The job will be redistributed to another client if no response arrives in a certain time frame-

- How many nodes can the build be paralellized on? 50? 100?

A typical number of nodes is 68 for 7 platforms (4 product builds and 3 non-product builds) which are build concurrently. The system is able to accommodate more if more build machines are added.

- What is the utilization of the cluster? Ie. How much paralellism are
you able to extract from the build? 75% 80%?

In the beginning of a build the parallelism is kinda restricted because prerequisites need to be build. Later the parallelism is pretty good, most of the time are all clients (nodes) are doing something. When creating package set (a significant part of the total build time) the parallelism is perfect.

How does the paralellism
scale, what is the optimum number of machines for a build.

Hard to tell, the above mentioned number of clients is not yet enough to "saturate" the system. Please note that we do 7 different builds in parallel and of course Solaris/Linux/Windows product and non-product builds can share their clients, meaning if there is nothing in the queue for a Linux product-build the client will happily build Linux non-product builds.


Is there anything I'm not asking about that I should?

Thanks for the answers and helping out with this!

Hope this helps,
  Heiner

--
Jens-Heiner Rechtien
[EMAIL PROTECTED]

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Reply via email to