On Fri, Mar 7, 2008 at 7:41 AM, Jon Blower <[EMAIL PROTECTED]> wrote: [...] > Thanks very much to everyone for very helpful responses on this. > Perhaps I should go into a bit more detail about our application. We > are writing an application for climate scientists that allows them to > run climate simulation codes on remote compute clusters. The codes > produce large amounts of data (100s of gigabytes as a typical example) > and we want the client to be able to download the output files from > the cluster as the simulation progresses (so that the user can monitor > what's going on and also reduce the disk footprint on the remote > cluster). The size of each file is of the order of gigabytes. > > A client will typically be downloading tens of output files > simultaneously, maybe more. We do not expect more than a handful of > users to be connected to our server at any one time. Nevertheless we > don't want to spawn a new thread for each file that is downloaded (we > could end up with hundreds of threads), which is essentially what we > are forced to do in our current servlet-based implementation. Another > disadvantage of our current system is that if we exhaust the thread > pool, new clients won't get any data at all until a thread is > released. I would rather have every client see a slow trickle than > have a single client monopolise the server. > > There will be minimal re-use of files (if all goes well a given file > will be downloaded exactly once) so caching won't help unfortunately. > We do have control over the clients generally but part of the point of > our design is that people can use their browser to download files if > they wish so we can't assume that this is always true. > > We can't simply use a straight web server (e.g. Apache) for this > because there is some other logic that goes along with the downloading > of files. For example, the files are generally append-only which > means that we can start the process of downloading an output file > before the file is completely written by the simulation code on the > cluster. The logic on the server side detects when a file is finished > and hence we can control when the client sees EOF. Apart from this > there isn't much state associated with the downloading of each file.
Cool. Let me restate to make sure we're on the same page with the details... * The simulations run for non-trivial amounts of time (minutes, hours, days) and generate very large data sets over that period of time. * The simulation data may come in spurts (i.e., you cannot guarantee a tight time bound). * The server will be doing some amount of processing of that data on it's way from the simulation to the client. But, this is stream-oriented processing (rather than big-bang processing of an entire data set). * There will be a very small number of clients. * Each client will be slurping down multiple data sets at any given time. * Total number of connections between clients and the server will be on the order of hundreds at any given time. * Nominally [see the first question below], there's only a single client sucking down any given data set. * You do not control the clients. I.e., can only assume a basic browser client. But you can potentially provide additional features/etc. to people using a client that you do control. * Graceful degradation of service matters to you. * You've had a bad experience with a servlet-based solution. * You can run up against the limits of fully saturating the network with the data transfers. Questions: * Since you ask about resuming downloads... What does that imply about the back-end data storage of the simulation data? I.e., do the simulators actually save the data in a long term, persistent storage or are they treated as primarily transient data (with perhaps some sliding window of persistence before being purged or is it only after a client has acknowledged complete receipt of each data set)? * What kind of server hardware are you running? For the sake of the rest of this post, I'll assume something "average" these days (2-4GB RAM, (dual-) dual-core at > 2GHz, plenty of local disk for any potential transient storage, if necessary). * Similarly, what OS platform are you running on? Any modern release of Linux v2.6 is a simple, no-brainer to support what you need for this. * What problems, exactly, did you run with your servlet solution? You've mentioned the thread exhaustion problem twice at only a few hundred threads (which is pathetically low for what you're talking about needing to do). I.e., did you do any tuning of the servlet container to provide for more threads (including reducing the default stack size)? Well, since the lowest common denominator is a plain, http browser client let's set aside some of the fancy tricks for now. So, my big unknown is why, given that you're talking about only needing, say, 1000 threads to comfortably support all of your users, won't a simple thread-based approach work for you (whether or not you're using servlets, restlets, or whatever). It sounds like your server code really only has to be smart enough to deal with the spurts of data from the simulators, perhaps do some minor amount of processing on the stream, and send the data to the client without getting caught by things like connection timeouts and inadvertent EOFs. That means that you can easily run >1,000 threads to handle all of your clients with no problems. Most of the hassle at this level is just learning what to tune at the various levels (file descriptors, ulimits, etc. at the OS level, stack and heap sizes at the JRE level, number of threads at the container level, etc.). That's actually going to be the simplest solution for your use case, AFAICS. Hope this helps, John

