Hi John (et al),

Thanks very much to everyone for very helpful responses on this.
Perhaps I should go into a bit more detail about our application.  We
are writing an application for climate scientists that allows them to
run climate simulation codes on remote compute clusters.  The codes
produce large amounts of data (100s of gigabytes as a typical example)
and we want the client to be able to download the output files from
the cluster as the simulation progresses (so that the user can monitor
what's going on and also reduce the disk footprint on the remote
cluster).  The size of each file is of the order of gigabytes.

A client will typically be downloading tens of output files
simultaneously, maybe more.  We do not expect more than a handful of
users to be connected to our server at any one time.  Nevertheless we
don't want to spawn a new thread for each file that is downloaded (we
could end up with hundreds of threads), which is essentially what we
are forced to do in our current servlet-based implementation.  Another
disadvantage of our current system is that if we exhaust the thread
pool, new clients won't get any data at all until a thread is
released.  I would rather have every client see a slow trickle than
have a single client monopolise the server.

There will be minimal re-use of files (if all goes well a given file
will be downloaded exactly once) so caching won't help unfortunately.
We do have control over the clients generally but part of the point of
our design is that people can use their browser to download files if
they wish so we can't assume that this is always true.

We can't simply use a straight web server (e.g. Apache) for this
because there is some other logic that goes along with the downloading
of files.  For example, the files are generally append-only which
means that we can start the process of downloading an output file
before the file is completely written by the simulation code on the
cluster.  The logic on the server side detects when a file is finished
and hence we can control when the client sees EOF.  Apart from this
there isn't much state associated with the downloading of each file.

I'm thinking of implementing this by wrapping some simple code around
an NIO FileChannel object that defers to this object for most
operations but the wrapping code will control the detection of EOF.

A related question: can I support HTTP range headers in Restlet?  If
so then we can support resumable downloads and also HTTP download
accelerators that open multiple streams and download different blocks
of data (the latter would of course increase the number of clients).

Thanks, Jon

On Fri, Mar 7, 2008 at 2:07 PM, John D. Mitchell <[EMAIL PROTECTED]> wrote:
> On Thu, Mar 6, 2008 at 7:14 AM, Jon Blower <[EMAIL PROTECTED]> wrote:
>  [...]
>
> >  We have an existing RESTful web application that involves clients
>  >  downloading multiple streams of data simultaneously.  Our current
>  >  implementation is based on servlets and we are experiencing
>  >  scalability problems with the number of threads involved in serving
>  >  multiple large data streams simultaneously.  I recently came across
>  >  Restlet and was attracted by the potential to use NIO under the hood
>  >  to enable more scalable large file transfers.
>
>  Cool.
>
>
>  >  In our case we are not necessarily serving large files that already
>  >  exist on disk: we are essentially creating the files ourselves on the
>  >  fly (so they are of unknown length when the file transfer starts).  I
>  >  was wondering if anyone could offer advice on how to support the
>  >  serving of such data streams through Restlet in a scalable manner
>  >  (ideally without creating a new thread on the server for each file
>  >  transfer)?
>
>  What do you mean by "large files"?  I.e., are talking about generating
>  content that is merely large relative to a web page (i.e., measured in
>  megabytes) or are you talking about something like complete hi-def
>  video (GBs in size) or something both large and nominally endless like
>  live video streams?
>
>  For the first case, if they are small enough I'd start by just fully
>  rendering the contents to a Representation as usual and profile how
>  well you can use the existing Jetty connector (with tuning, etc.).  As
>  you add more simultaneous clients, add more servers.  Also, run your
>  experiments with the new Grizzly connector and track that as it and
>  v1.1+ stabilizes.
>
>  For the second case (or where you have content sizes in the first case
>  but lots of slow clients), I'd actually have that part of my origin
>  servers either be fronted by a reverse-caching-proxy (e.g., squid) or
>  generate and dump the contents from the origin server into a local
>  file and redirect the client to get that content from e.g., lighttpd
>  (+mod_secdownload).  Depending on the nature of your client
>  applications, the potential reuse of the generated content, etc. you
>  can tune how you clean up the caches.
>
>  For the last case, if I controlled the clients then I'd probably have
>  the clients request good-sized chunks of the data in a loop and
>  devolve to the appropriate combination of the first two approaches. Of
>  course, that's more or less presuming that you can generate those
>  chunks more or less independently (i.e., with minimal state
>  information needed to keep the continuity from chunk to chunk).  If
>  you have heavy amounts of state and/or if you don't control the
>  clients then I'd want to know a good bit more before making any
>  recommendation.
>
>  Hope this helps,
>  John
>



-- 
--------------------------------------------------------------
Dr Jon Blower              Tel: +44 118 378 5213 (direct line)
Technical Director         Tel: +44 118 378 8741 (ESSC)
Reading e-Science Centre   Fax: +44 118 378 6413
ESSC                       Email: [EMAIL PROTECTED]
University of Reading
3 Earley Gate
Reading RG6 6AL, UK
--------------------------------------------------------------

Reply via email to