Re: [boinc_dev] preliminary findings from RNA World VirtualBox Betatest

Rom Walton Wed, 18 Sep 2013 13:42:24 -0700

Okay, I've updated vboxwrapper so that future VM names will be based off
of slot directory IDs.  I have also set things up so that the
description field for the VM should contain the task name on newer
versions of VirtualBox.


I do not see any obvious reason why the task was being considered a
crash.  I've changed the wrapper so that if it detects a VM state we
know isn't a crash we reset the crash variable to false.

x86: http://boinc.berkeley.edu/dl/vboxwrapper_26026_windows_intelx86.zip
x64: http://boinc.berkeley.edu/dl/vboxwrapper_26026_windows_x86_64.zip

----- Rom

-----Original Message-----
From: Christian Beer [mailto:[email protected]] 
Sent: Wednesday, September 18, 2013 2:57 AM
To: Rom Walton
Cc: [email protected]; BOINCDev Mailing List
Subject: Re: [boinc_dev] preliminary findings from RNA World VirtualBox
Betatest

Hi Rom,

I'm all for shortening the VM name. Maybe you could use something like
projectname_x where x is the slot ID. This way the volunteers can still
see what Project is running the VM. But this may still be problematic
with long projectnames. So I would rather vote for the boinc_slot_x
name.

As for the premature shutdown I attached you some vboxwrapper log
extracts that show the problematic and successful results (I have 15
results with error 194 and pretty long stderr_out. I can send them to
you if needed). I've seen the problems with 7.2.5, 7.0.64, 7.0.28 ( all
Windows) I'm running the vboxwrapper_26024 I don't specify any timeouts
for the vboxwrapper.

May vbox_job.xml is:

<vbox_job>
  <os_name>Debian_64</os_name>
  <memory_size_mb>4096</memory_size_mb>
  <enable_shared_directory/>
  <fraction_done_filename>shared/progress.txt</fraction_done_filename>
</vbox_job>

Regards
Christian

Am 17.09.2013 23:54, schrieb Rom Walton:
> Christian,
>
> All we need for the VM name is something that is unique, we could 
> change the scheme to boinc_slot_x where x is the slot ID or something 
> like that.  I originally went with the result name so that volunteers 
> could match up the VM with the task that was executing.
>
> Do you happen to have any more information about the premature
shutdown?
> Are you using a time limit for your results?  IIRC, the way we 
> originally envisioned things is that when vboxwrapper detects that the

> VM shutdown without the wrapper itself telling the VM to shutdown it 
> assumes the result is complete.
>
> If you are seeing the 'premature' text, it implies that a timeout 
> value has been specified.
>
> ----- Rom
>
>
> -----Original Message-----
> From: boinc_dev [mailto:[email protected]] On Behalf 
> Of Christian Beer
> Sent: Tuesday, September 17, 2013 2:21 PM
> To: [email protected]; BOINCDev Mailing List
> Subject: [boinc_dev] preliminary findings from RNA World VirtualBox 
> Betatest
>
> Hello,
>
> during the last 7days I ran a Betatest on RNA World with the new 
> VirtualBox feature of BOINC. These are some preliminary findings that 
> I would like to present to the BOINC developers and project
communities.
> For whom it is TL;DR let me summarize: I present some problems we ran 
> into, using the vboxwrapper and possible solutions that I want to 
> discuss at the upcoming BOINC workshop in Grenoble.
>
> First let me distinguish our usage of the vboxwrapper from existing 
> Projects that are using the same technology. Test4Theory is using the 
> VB feature to distribute a CERN-own VM to volunteer computers but is 
> not sending work via BOINC. They use some other work distribution 
> technologie that is outside of BOINC. I couldn't find how they grant 
> credit.
> Climate@home is also using vboxwrapper with a rather large VM and 
> uploads a temporary result every hour. It seems they grant a fixed 
> amount of credit per workunit.
>
> The main purpose for RNA World to implement the vboxwrapper was the 
> lack of native checkpointing by our generic application. As none of 
> the other projects seem to use this feature it is crucial to us that 
> it works. At first I found out that if we run the scientific 
> application, controlled by a bash script, in the shared/ directory on 
> the host that the bash script will end with an error if the VM get's
restored from a snapshot.
> The solution is to move the computation inside the VM and just copy 
> the output files into the shared/ directory when finished. This works 
> very well for us.
>
> Another thing I noticed in the Betatest is a pathlength issue with 
> Windows systems. At the moment the vboxwrapper creates a VM for every 
> task using the taskname as identifier. As RNA World uses rather large 
> tasknames this shortly triggered the 256 character path length limit 
> on Windows. I circumvented this by shortening the taskname but we 
> loose some information that is useful for volunteers, because they 
> want to know what bacteria or virus their computer is working on. I 
> don't consider this a real solution at the moment.
>
> The next thing that I noticed during the test was that under some 
> unknown circumstances the vboxwrapper detects a premature shutdown of 
> the VM during the normal shutdown at the end of the task. This gets 
> reported with exit_status=194 and considered an error. Nevertheless 
> are the output files already present and get uploaded to the project
server.
> I now manually reset the task's outcome to success but I would like to

> not do this in the future. So whatever happens after the vboxwrapper 
> detects a normal shutdown of the VM it should not be considered an 
> error per se as the output files may be present and valid (which is 
> later checked on the server if outcome=success).
>
> The test was focused on 64bit machines with more than 4GB of RAM so I 
> set memory_size_mb in vbox_job.xml to 4096 and rsc_memory_bound to 
> 4.2e9 in the input template. As it turned out this was recognized by 
> the scheduler and only machines that had more than 4GB RAM got work. 
> But nevertheless the CLient didn't take this into account as I had 
> four RNA World VMs on my Intel Core i5 laptop with only 5GB RAM 
> running at the same time. When all VMs where powering up at the same 
> time my computer got very laggy (which was also reported by other
users in the forums).
> In the end three of the VMs errored out with different messages and 
> only one is still crunching. I can only presume that it is the lack of

> memory that is causing this because I also have a desktop machine with

> 16GB of RAM and no errors and no lags. So what would be a solution to 
> us is that the Client will not start any VBox task if there is not 
> enough memory available. If this should be the current behaviour we 
> may have to run a more extensive test on this issue.
>
> I also noticed that Credit seems to be calculated using cpu_time 
> rather than elapsed_time (which is way higher) is this intentional? We

> just recently switched to CreditNew and are also granting Credit for 
> our non-VM application that reports a high cpu_time thus more Credit 
> compared to the VM app. This is not critical at the moment but as soon

> as we go into production with this app it is.
>
> Another point that I would like to have fixed in the future is the 
> inability for the science app to communicate with the vboxwrapper. At 
> the moment only the fraction_done value is read from a file that I 
> specify in the vbox_job.xml and the init_data.xml can be shared with 
> the VM using a floppy disk image. What I want is to trigger a specific

> abort (error code) from within the VM. As our application is generic 
> and we can't say how much RAM the task needs we may have to reschedule

> the task with a higher rsc_memory_bound and memory_size_mb. At the 
> moment I try to change our vbox_control script to detect this case and

> shutdown the VM and somehow tweak the output files that our validator 
> can recognize this. What I would like to have is another file in 
> shared/ that I can fill with an error_code and the vboxwrapper is 
> transmitting this value as exit_status. Now the scheduler can handle 
> this and send a new result with a higher rsc_memory_bound.
>
> That's it for this first Betatest.
>
> Regards
> Christian
> _______________________________________________
> boinc_dev mailing list
> [email protected]
> http://lists.ssl.berkeley.edu/mailman/listinfo/boinc_dev
> To unsubscribe, visit the above URL and (near bottom of page) enter 
> your email address.

_______________________________________________
boinc_dev mailing list
[email protected]
http://lists.ssl.berkeley.edu/mailman/listinfo/boinc_dev
To unsubscribe, visit the above URL and
(near bottom of page) enter your email address.

Re: [boinc_dev] preliminary findings from RNA World VirtualBox Betatest

Reply via email to