Re: [boinc_dev] Optimizing uploads.....

Richard Haselgrove Thu, 16 Jul 2009 09:01:26 -0700

> That only works if you have the bandwidth to actually get the data to the
> validator.  The problem at the moment in SETI is the last mile of internet
> connection.  There are several possible solutions, but having the upload
> servers else where does not really help that much.  The uploaded data still
> has to go through that last mile.
> 
> jm7


I'd forgotten this message when I woke up this morning and posted what I 
believe to be a possible interim solution, specifically for SETI:

http://setiathome.berkeley.edu/forum_thread.php?id=54631#918399

My idea: put a stripped-down upload server at the head-end of the 1GB Hurricane 
Electric link - co-location at PAIX, Campus or wherever. That server would have 
the minimum possible BOINC functionality - basically, just the cgi upload 
handler. It would perform one function only - to handle the million-plus upload 
connections per day, accept and store the files. Periodically, it would zip the 
files into an archive, and make ONE file available to SSL - push or pull, your 
choice. The data gets up the hill, but the million-plus connections (or ten 
million plus connection attempts) don't.

The figures: we reckon that you'd get 10,000 upload files, zipped to about 45 
megabytes, every seven or eight minutes. That averages (and please check all 
these figures) to about 1 megabit/sec: it might even be possible to negotiate 
with Campus to utilise their network path, and avoid the Hurricane tunnel 
entirely.

I would anticipate:
Create a folder for uploaded files
Accept data until five minutes/10,000 file limit
Create new folder, and rotate incoming files to it.
Once all connections to the first folder are complete/timed out, zip it and 
signal availability
On confirmation of transfer and receipt, delete folder and zip
Rinse and repeat

John raised four issues on the message boards, which I'll summarise (John, feel 
free to amplify if you think I've misrepresented you).

1) SETI data is nearly incompressible - zipping won't help
True, but that applies to the raw data files sent FROM Berkeley TO volunteers. 
In SETI's case, the return data is small text/XML files, which do compress. But 
John's point means that my suggestion can't be generalised to all BOINC 
projects - those which have larger upload files, probably compressed already 
(like Einstein and CPDN) wouldn't benefit.

2) Even unzipping the archive on receipt requires scarce CPU power
Someone at Berkeley will have to do the maths, but I think offloading those 
million connections to a different server should release some spare CPU cycles. 
And is unzip a particularly costly process?

3) The zip file still has to get up the hill
And could suffer packet loss. But is packet loss/retry more or less costly than 
connection loss/resend? Maybe Lynn can help with that one.

4) Reports are asynchronous and can occur at any time after the file is uploaded
This is the tricky one, and would require one minor BOINC server change.

Strictly speaking, it doesn't matter whether reporting is asynchronous or 
synchronous: the critical path in the current server process is that the file 
is uploaded before the validator runs. That is enforced by two separate 
sequential rules:

The validator runs after the result is reported (enforced by the server)
The result is reported after the file is uploaded (enforced by the client)

But if we could relax the critical path, the upload/report sequence becomes 
asynchronous. And we can relax the critical path simply by saying that the 
validator outcome "file not present" is transitioned to backoff/retry, instead 
of immediate failure

To summarise -

Advantages
------------
Relatively simple server requirements for the 'data concentrator' - just cgi, 
and some filesystem-level cron scripting
Much cheaper than $80,000 for 'fibre up the hill'
Quicker to implement than more esoteric suggestions - I don't think there's 
anything above that's more complicated than the staff regularly achieve in 
their sleep!
Scalable - multiple concentrators could be set up, on different continents if 
desired.
Reversible - just switch the upload DNS to point back to Bruno, and it'll work 
as before

Disadvantages
---------------
Another server to buy/scrounge, configure and manage
At a remote location
Requires a change to Validator logic, plus safeties to scavenge tasks which go 
into infinite backoff
Adds latency to the report/validate cycle, hence an increase in temporary 
storage/database use
Delayed user gratification (well, for half the users, anyway - the half who 
report second)
_______________________________________________
boinc_dev mailing list
[email protected]
http://lists.ssl.berkeley.edu/mailman/listinfo/boinc_dev
To unsubscribe, visit the above URL and
(near bottom of page) enter your email address.

Re: [boinc_dev] Optimizing uploads.....

Reply via email to