On Fri, Mar 7, 2008 at 7:41 AM, Jon Blower <[EMAIL PROTECTED]> wrote:
[...]
>  Thanks very much to everyone for very helpful responses on this.
>  Perhaps I should go into a bit more detail about our application.  We
>  are writing an application for climate scientists that allows them to
>  run climate simulation codes on remote compute clusters.  The codes
>  produce large amounts of data (100s of gigabytes as a typical example)
>  and we want the client to be able to download the output files from
>  the cluster as the simulation progresses (so that the user can monitor
>  what's going on and also reduce the disk footprint on the remote
>  cluster).  The size of each file is of the order of gigabytes.
>
>  A client will typically be downloading tens of output files
>  simultaneously, maybe more.  We do not expect more than a handful of
>  users to be connected to our server at any one time.  Nevertheless we
>  don't want to spawn a new thread for each file that is downloaded (we
>  could end up with hundreds of threads), which is essentially what we
>  are forced to do in our current servlet-based implementation.  Another
>  disadvantage of our current system is that if we exhaust the thread
>  pool, new clients won't get any data at all until a thread is
>  released.  I would rather have every client see a slow trickle than
>  have a single client monopolise the server.
>
>  There will be minimal re-use of files (if all goes well a given file
>  will be downloaded exactly once) so caching won't help unfortunately.
>  We do have control over the clients generally but part of the point of
>  our design is that people can use their browser to download files if
>  they wish so we can't assume that this is always true.
>
>  We can't simply use a straight web server (e.g. Apache) for this
>  because there is some other logic that goes along with the downloading
>  of files.  For example, the files are generally append-only which
>  means that we can start the process of downloading an output file
>  before the file is completely written by the simulation code on the
>  cluster.  The logic on the server side detects when a file is finished
>  and hence we can control when the client sees EOF.  Apart from this
>  there isn't much state associated with the downloading of each file.

Cool.  Let me restate to make sure we're on the same page with the details...

* The simulations run for non-trivial amounts of time (minutes, hours,
days) and generate very large data sets over that period of time.

* The simulation data may come in spurts (i.e., you cannot guarantee a
tight time bound).

* The server will be doing some amount of processing of that data on
it's way from the simulation to the client. But, this is
stream-oriented processing (rather than big-bang processing of an
entire data set).

* There will be a very small number of clients.

* Each client will be slurping down multiple data sets at any given time.

* Total number of connections between clients and the server will be
on the order of hundreds at any given time.

* Nominally [see the first question below], there's only a single
client sucking down any given data set.

* You do not control the clients. I.e., can only assume a basic
browser client.  But you can potentially provide additional
features/etc. to people using a client that you do control.

* Graceful degradation of service matters to you.

* You've had a bad experience with a servlet-based solution.

* You can run up against the limits of fully saturating the network
with the data transfers.


Questions:

* Since you ask about resuming downloads... What does that imply about
the back-end data storage of the simulation data?  I.e., do the
simulators actually save the data in a long term, persistent storage
or are they treated as primarily transient data (with perhaps some
sliding window of persistence before being purged or is it only after
a client has acknowledged complete receipt of each data set)?

* What kind of server hardware are you running?  For the sake of the
rest of this post, I'll assume something "average" these days (2-4GB
RAM, (dual-) dual-core at > 2GHz, plenty of local disk for any
potential transient storage, if necessary).

* Similarly, what OS platform are you running on?  Any modern release
of Linux v2.6 is a simple, no-brainer to support what you need for
this.

* What problems, exactly, did you run with your servlet solution?
You've mentioned the thread exhaustion problem twice at only a few
hundred threads (which is pathetically low for what you're talking
about needing to do).  I.e., did you do any tuning of the servlet
container to provide for more threads (including reducing the default
stack size)?


Well, since the lowest common denominator is a plain, http browser
client let's set aside some of the fancy tricks for now.

So, my big unknown is why, given that you're talking about only
needing, say, 1000 threads to comfortably support all of your users,
won't a simple thread-based approach work for you (whether or not
you're using servlets, restlets, or whatever).  It sounds like your
server code really only has to be smart enough to deal with the spurts
of data from the simulators, perhaps do some minor amount of
processing on the stream, and send the data to the client without
getting caught by things like connection timeouts and inadvertent
EOFs.  That means that you can easily run >1,000 threads to handle all
of your clients with no problems.  Most of the hassle at this level is
just learning what to tune at the various levels (file descriptors,
ulimits, etc. at the OS level, stack and heap sizes at the JRE level,
number of threads at the container level, etc.).  That's actually
going to be the simplest solution for your use case, AFAICS.

Hope this helps,
John

Reply via email to