Re: [boinc_dev] preliminary findings from RNA World VirtualBox Betatest

Rom Walton Tue, 17 Sep 2013 15:14:46 -0700

Christian,

All we need for the VM name is something that is unique, we could change
the scheme to boinc_slot_x where x is the slot ID or something like
that.  I originally went with the result name so that volunteers could
match up the VM with the task that was executing.


Do you happen to have any more information about the premature shutdown?
Are you using a time limit for your results?  IIRC, the way we
originally envisioned things is that when vboxwrapper detects that the
VM shutdown without the wrapper itself telling the VM to shutdown it
assumes the result is complete.

If you are seeing the 'premature' text, it implies that a timeout value
has been specified.

----- Rom


-----Original Message-----
From: boinc_dev [mailto:[email protected]] On Behalf Of
Christian Beer
Sent: Tuesday, September 17, 2013 2:21 PM
To: [email protected]; BOINCDev Mailing List
Subject: [boinc_dev] preliminary findings from RNA World VirtualBox
Betatest

Hello,

during the last 7days I ran a Betatest on RNA World with the new
VirtualBox feature of BOINC. These are some preliminary findings that I
would like to present to the BOINC developers and project communities.
For whom it is TL;DR let me summarize: I present some problems we ran
into, using the vboxwrapper and possible solutions that I want to
discuss at the upcoming BOINC workshop in Grenoble.

First let me distinguish our usage of the vboxwrapper from existing
Projects that are using the same technology. Test4Theory is using the VB
feature to distribute a CERN-own VM to volunteer computers but is not
sending work via BOINC. They use some other work distribution
technologie that is outside of BOINC. I couldn't find how they grant
credit.
Climate@home is also using vboxwrapper with a rather large VM and
uploads a temporary result every hour. It seems they grant a fixed
amount of credit per workunit.

The main purpose for RNA World to implement the vboxwrapper was the lack
of native checkpointing by our generic application. As none of the other
projects seem to use this feature it is crucial to us that it works. At
first I found out that if we run the scientific application, controlled
by a bash script, in the shared/ directory on the host that the bash
script will end with an error if the VM get's restored from a snapshot.
The solution is to move the computation inside the VM and just copy the
output files into the shared/ directory when finished. This works very
well for us.

Another thing I noticed in the Betatest is a pathlength issue with
Windows systems. At the moment the vboxwrapper creates a VM for every
task using the taskname as identifier. As RNA World uses rather large
tasknames this shortly triggered the 256 character path length limit on
Windows. I circumvented this by shortening the taskname but we loose
some information that is useful for volunteers, because they want to
know what bacteria or virus their computer is working on. I don't
consider this a real solution at the moment.

The next thing that I noticed during the test was that under some
unknown circumstances the vboxwrapper detects a premature shutdown of
the VM during the normal shutdown at the end of the task. This gets
reported with exit_status=194 and considered an error. Nevertheless are
the output files already present and get uploaded to the project server.
I now manually reset the task's outcome to success but I would like to
not do this in the future. So whatever happens after the vboxwrapper
detects a normal shutdown of the VM it should not be considered an error
per se as the output files may be present and valid (which is later
checked on the server if outcome=success).

The test was focused on 64bit machines with more than 4GB of RAM so I
set memory_size_mb in vbox_job.xml to 4096 and rsc_memory_bound to 4.2e9
in the input template. As it turned out this was recognized by the
scheduler and only machines that had more than 4GB RAM got work. But
nevertheless the CLient didn't take this into account as I had four RNA
World VMs on my Intel Core i5 laptop with only 5GB RAM running at the
same time. When all VMs where powering up at the same time my computer
got very laggy (which was also reported by other users in the forums).
In the end three of the VMs errored out with different messages and only
one is still crunching. I can only presume that it is the lack of memory
that is causing this because I also have a desktop machine with 16GB of
RAM and no errors and no lags. So what would be a solution to us is that
the Client will not start any VBox task if there is not enough memory
available. If this should be the current behaviour we may have to run a
more extensive test on this issue.

I also noticed that Credit seems to be calculated using cpu_time rather
than elapsed_time (which is way higher) is this intentional? We just
recently switched to CreditNew and are also granting Credit for our
non-VM application that reports a high cpu_time thus more Credit
compared to the VM app. This is not critical at the moment but as soon
as we go into production with this app it is.

Another point that I would like to have fixed in the future is the
inability for the science app to communicate with the vboxwrapper. At
the moment only the fraction_done value is read from a file that I
specify in the vbox_job.xml and the init_data.xml can be shared with the
VM using a floppy disk image. What I want is to trigger a specific abort
(error code) from within the VM. As our application is generic and we
can't say how much RAM the task needs we may have to reschedule the task
with a higher rsc_memory_bound and memory_size_mb. At the moment I try
to change our vbox_control script to detect this case and shutdown the
VM and somehow tweak the output files that our validator can recognize
this. What I would like to have is another file in shared/ that I can
fill with an error_code and the vboxwrapper is transmitting this value
as exit_status. Now the scheduler can handle this and send a new result
with a higher rsc_memory_bound.

That's it for this first Betatest.

Regards
Christian
_______________________________________________
boinc_dev mailing list
[email protected]
http://lists.ssl.berkeley.edu/mailman/listinfo/boinc_dev
To unsubscribe, visit the above URL and
(near bottom of page) enter your email address.
_______________________________________________
boinc_dev mailing list
[email protected]
http://lists.ssl.berkeley.edu/mailman/listinfo/boinc_dev
To unsubscribe, visit the above URL and
(near bottom of page) enter your email address.

Re: [boinc_dev] preliminary findings from RNA World VirtualBox Betatest

Reply via email to