Hello all. I'm looking for some advice on whether Coda would be
appropriate for my situation. I have read the FAQ and docs, but it
seems that most people are using Coda for a different application and
all the docs have that in mind. It may be nobody uses Coda the way I'm
thinking because it's a stupid idea, so that's why I'm asking!
At present we have a small server farm which processes HTML logs for a
large number of websites (in the tens of thousands). The way this works
is we have seven processing servers which do nothing but parse logs and
generate reports. The reports, once created, are saved forever to a
storage server (via NFS) which is directly attached to a big (~1.9TB) array.
There are two problems here. First, if the storage server goes offline
for whatever reason (like if NFS decides to flake out for a few
seconds), all the processing servers hang and have to be power cycled.
This is Very Bad. And second, the processing servers all need to get
the actual logs and store them on a local filesystem (because NFS is way
too slow). These logs can get very very big so if a bunch of requests
arrive all at once - like when the monthly reports are automatically
generated - they tend to run out of disk space and die.
I am thinking that Coda might be able to provide a solution to these
problems - a better one than the one we are using now, anyway, which is
"throw more hideously expensive servers at the problem and hope it goes
away."
My rough-sketch thought process is that, naturally, the storage server
would still provide all the disk space. Each Coda client (the
processing servers) would make a cache about the size of the partition
it's using now for its temporary files (anywhere from 50-140GB).
The first problem would be solved, or at least mitigated, because the
coda clients could still do their thing if the storage server crapped
out for a while. Most of the time the processing servers run at 40%
disk capacity so that should leave enough space for at least an hour or
two of disconnected operation. If the array fails or something, we're
pretty screwed anyway, but at least we could process requests for the
time it takes to reinitialize the thing.
The second problem would be, again, solved or at least mitigated because
the coda clients would have "emergency backup" storage. The temporary
files would be written to /coda and go in the cache. Since those files
only have lifetimes of a few minutes and are never requested by other
clients in the farm, they should never need to go over the network
unless the cache fills up. If the cache does fill up, which it will at
least once a month, performance will degrade (significantly) but at
least the requests will still be processed. Once the backlog gets
handled (takes about a day) the caches will clear out and everything
will go back to normal. If the storage array fills up to the point
where we're running out of disk space again, we can add another (much
smaller) one just for this temporary storage. This way we could avoid
dropping huge sums on a quad-CPU box, which is what we're doing now,
just to add storage capacity.
Now after that lengthy story and rationale, my first question is
obvious: Is my reasoning correct? Is this something Coda can do, even
though most people aren't using it quite like I want to?
Second question should also be pretty obvious and it's a FAQ: Is Coda
reliable enough to be used for this? I know the FAQ says "no" but our
current solution is already terribly unreliable (we lose a ton of
reports every month due to the disk space problems alone; I have no hard
figures but I'd estimate up to an eighth of the monthly requests are
lost, forcing our customers to request their reports manually a day or
week later). As long as Coda can more-or-less guarantee that the
archived reports on the storage array won't get trashed... If the major
reliability concern is having to restart coda processes occasionally, we
can do that.
Third question is related to the first two: If Coda is not, for whatever
reason, appropriate for this, is there something similar which is? I've
looked at Lustre, and probably will continue looking at it, but it seems
geared towards much larger clusters than ours. It may also be much less
of a "drop-in" replacement than Coda, which is by now a standard part of
the free Unix-like OSes.
Thanks in advance for any tips or pointers, and to anyone who actually
read this whole thing, I admire your stamina!