Hi Rom,
I'm all for shortening the VM name. Maybe you could use something like
projectname_x where x is the slot ID. This way the volunteers can still
see what Project is running the VM. But this may still be problematic
with long projectnames. So I would rather vote for the boinc_slot_x name.
As for the premature shutdown I attached you some vboxwrapper log
extracts that show the problematic and successful results (I have 15
results with error 194 and pretty long stderr_out. I can send them to
you if needed). I've seen the problems with 7.2.5, 7.0.64, 7.0.28 ( all
Windows) I'm running the vboxwrapper_26024 I don't specify any timeouts
for the vboxwrapper.
May vbox_job.xml is:
<vbox_job>
<os_name>Debian_64</os_name>
<memory_size_mb>4096</memory_size_mb>
<enable_shared_directory/>
<fraction_done_filename>shared/progress.txt</fraction_done_filename>
</vbox_job>
Regards
Christian
Am 17.09.2013 23:54, schrieb Rom Walton:
> Christian,
>
> All we need for the VM name is something that is unique, we could change
> the scheme to boinc_slot_x where x is the slot ID or something like
> that. I originally went with the result name so that volunteers could
> match up the VM with the task that was executing.
>
> Do you happen to have any more information about the premature shutdown?
> Are you using a time limit for your results? IIRC, the way we
> originally envisioned things is that when vboxwrapper detects that the
> VM shutdown without the wrapper itself telling the VM to shutdown it
> assumes the result is complete.
>
> If you are seeing the 'premature' text, it implies that a timeout value
> has been specified.
>
> ----- Rom
>
>
> -----Original Message-----
> From: boinc_dev [mailto:[email protected]] On Behalf Of
> Christian Beer
> Sent: Tuesday, September 17, 2013 2:21 PM
> To: [email protected]; BOINCDev Mailing List
> Subject: [boinc_dev] preliminary findings from RNA World VirtualBox
> Betatest
>
> Hello,
>
> during the last 7days I ran a Betatest on RNA World with the new
> VirtualBox feature of BOINC. These are some preliminary findings that I
> would like to present to the BOINC developers and project communities.
> For whom it is TL;DR let me summarize: I present some problems we ran
> into, using the vboxwrapper and possible solutions that I want to
> discuss at the upcoming BOINC workshop in Grenoble.
>
> First let me distinguish our usage of the vboxwrapper from existing
> Projects that are using the same technology. Test4Theory is using the VB
> feature to distribute a CERN-own VM to volunteer computers but is not
> sending work via BOINC. They use some other work distribution
> technologie that is outside of BOINC. I couldn't find how they grant
> credit.
> Climate@home is also using vboxwrapper with a rather large VM and
> uploads a temporary result every hour. It seems they grant a fixed
> amount of credit per workunit.
>
> The main purpose for RNA World to implement the vboxwrapper was the lack
> of native checkpointing by our generic application. As none of the other
> projects seem to use this feature it is crucial to us that it works. At
> first I found out that if we run the scientific application, controlled
> by a bash script, in the shared/ directory on the host that the bash
> script will end with an error if the VM get's restored from a snapshot.
> The solution is to move the computation inside the VM and just copy the
> output files into the shared/ directory when finished. This works very
> well for us.
>
> Another thing I noticed in the Betatest is a pathlength issue with
> Windows systems. At the moment the vboxwrapper creates a VM for every
> task using the taskname as identifier. As RNA World uses rather large
> tasknames this shortly triggered the 256 character path length limit on
> Windows. I circumvented this by shortening the taskname but we loose
> some information that is useful for volunteers, because they want to
> know what bacteria or virus their computer is working on. I don't
> consider this a real solution at the moment.
>
> The next thing that I noticed during the test was that under some
> unknown circumstances the vboxwrapper detects a premature shutdown of
> the VM during the normal shutdown at the end of the task. This gets
> reported with exit_status=194 and considered an error. Nevertheless are
> the output files already present and get uploaded to the project server.
> I now manually reset the task's outcome to success but I would like to
> not do this in the future. So whatever happens after the vboxwrapper
> detects a normal shutdown of the VM it should not be considered an error
> per se as the output files may be present and valid (which is later
> checked on the server if outcome=success).
>
> The test was focused on 64bit machines with more than 4GB of RAM so I
> set memory_size_mb in vbox_job.xml to 4096 and rsc_memory_bound to 4.2e9
> in the input template. As it turned out this was recognized by the
> scheduler and only machines that had more than 4GB RAM got work. But
> nevertheless the CLient didn't take this into account as I had four RNA
> World VMs on my Intel Core i5 laptop with only 5GB RAM running at the
> same time. When all VMs where powering up at the same time my computer
> got very laggy (which was also reported by other users in the forums).
> In the end three of the VMs errored out with different messages and only
> one is still crunching. I can only presume that it is the lack of memory
> that is causing this because I also have a desktop machine with 16GB of
> RAM and no errors and no lags. So what would be a solution to us is that
> the Client will not start any VBox task if there is not enough memory
> available. If this should be the current behaviour we may have to run a
> more extensive test on this issue.
>
> I also noticed that Credit seems to be calculated using cpu_time rather
> than elapsed_time (which is way higher) is this intentional? We just
> recently switched to CreditNew and are also granting Credit for our
> non-VM application that reports a high cpu_time thus more Credit
> compared to the VM app. This is not critical at the moment but as soon
> as we go into production with this app it is.
>
> Another point that I would like to have fixed in the future is the
> inability for the science app to communicate with the vboxwrapper. At
> the moment only the fraction_done value is read from a file that I
> specify in the vbox_job.xml and the init_data.xml can be shared with the
> VM using a floppy disk image. What I want is to trigger a specific abort
> (error code) from within the VM. As our application is generic and we
> can't say how much RAM the task needs we may have to reschedule the task
> with a higher rsc_memory_bound and memory_size_mb. At the moment I try
> to change our vbox_control script to detect this case and shutdown the
> VM and somehow tweak the output files that our validator can recognize
> this. What I would like to have is another file in shared/ that I can
> fill with an error_code and the vboxwrapper is transmitting this value
> as exit_status. Now the scheduler can handle this and send a new result
> with a higher rsc_memory_bound.
>
> That's it for this first Betatest.
>
> Regards
> Christian
> _______________________________________________
> boinc_dev mailing list
> [email protected]
> http://lists.ssl.berkeley.edu/mailman/listinfo/boinc_dev
> To unsubscribe, visit the above URL and
> (near bottom of page) enter your email address.
This is a successful result that returned with exit_status=0 from the same
workunit as the problematic result:
Operating System: Windows 7 Ultimate x64 SP1
<core_client_version>7.0.64</core_client_version>
<![CDATA[
<stderr_txt>
2013-09-12 20:45:49 (4528): Detected: VirtualBox 4.2.16r86992
2013-09-12 20:45:52 (4528): Registering VM.
(boinc_cmsvm_GA-p[BR-Lin64f]_1_Bacillus-amyloliquefaciens-FZB42_1379006666_9_0)
2013-09-12 20:45:52 (4528): Setting CPU Count for VM. (1)
2013-09-12 20:45:53 (4528): Setting Memory Size for VM. (4096MB)
2013-09-12 20:45:54 (4528): Setting Chipset Options for VM.
2013-09-12 20:45:54 (4528): Setting Boot Options for VM.
2013-09-12 20:45:55 (4528): Setting Network Configuration for VM.
2013-09-12 20:45:55 (4528): Disabling USB Support for VM.
2013-09-12 20:45:55 (4528): Disabling COM Port Support for VM.
2013-09-12 20:45:56 (4528): Disabling LPT Port Support for VM.
2013-09-12 20:45:56 (4528): Disabling Audio Support for VM.
2013-09-12 20:45:56 (4528): Disabling Clipboard Support for VM.
2013-09-12 20:45:57 (4528): Disabling Drag and Drop Support for VM.
2013-09-12 20:45:57 (4528): Adding storage controller to VM.
2013-09-12 20:45:58 (4528): Adding virtual disk drive to VM.
2013-09-12 20:45:58 (4528): Enabling shared directory for VM.
2013-09-12 20:45:59 (4528): Starting VM.
2013-09-12 20:46:01 (4528): Successfully started VM.
2013-09-12 20:46:01 (4528): Setting cpu throttle for VM. (100%)
....
2013-09-13 06:07:39 (4528): Powering off VM.
2013-09-13 06:07:40 (4528): Successfully powered off VM.
2013-09-13 23:36:47 (1788): Detected: VirtualBox 4.2.16r86992
2013-09-13 23:36:50 (1788): Restore from previously saved snapshot.
2013-09-13 23:36:50 (1788): Restore completed.
2013-09-13 23:36:50 (1788): Starting VM.
2013-09-13 23:37:07 (1788): Successfully started VM.
2013-09-13 23:37:07 (1788): Setting cpu throttle for VM. (100%)
....
2013-09-14 08:13:38 (1788): Powering off VM.
2013-09-14 08:13:39 (1788): Successfully powered off VM.
2013-09-14 09:43:18 (8640): Detected: VirtualBox 4.2.16r86992
2013-09-14 09:43:23 (8640): Restore from previously saved snapshot.
2013-09-14 09:43:23 (8640): Restore completed.
2013-09-14 09:43:23 (8640): Starting VM.
2013-09-14 09:43:59 (8640): Successfully started VM.
2013-09-14 09:43:59 (8640): Setting cpu throttle for VM. (100%)
....
2013-09-14 14:38:22 (8640): Creating new snapshot for VM.
2013-09-14 14:38:30 (8640): Deleting stale snapshot.
2013-09-14 14:38:34 (8640): Checkpoint completed.
2013-09-14 14:42:19 (8640): VM is no longer is a running state. It is in
'poweroff'.
2013-09-14 14:42:19 (8640): Powering off VM.
2013-09-14 14:42:19 (8640): Deregistering VM.
2013-09-14 14:42:20 (8640): Deleting stale snapshot.
2013-09-14 14:42:20 (8640): Removing storage controller(s) from VM.
2013-09-14 14:42:20 (8640): Removing VM from VirtualBox.
2013-09-14 14:42:20 (8640): Removing virtual disk drive from VirtualBox.
2013-09-14 14:42:26 (8640): Virtual machine exited.
14:42:26 (8640): called boinc_finish
This is a problematic result that returned with exit_status=194 :
Operating System: Windows 7 Home Premium x64 SP1
<core_client_version>7.2.5</core_client_version>
<![CDATA[
<message>
(unknown error) - exit code 194 (0xc2)
</message>
<stderr_txt>
2013-09-13 21:52:17 (16108): Detected: VirtualBox 4.2.16r86992
2013-09-13 21:52:18 (16108): Registering VM.
(boinc_cmsvm_GA-p[BR-Lin64f]_1_Bacillus-amyloliquefaciens-FZB42_1379006666_9_2)
2013-09-13 21:52:19 (16108): Setting CPU Count for VM. (1)
2013-09-13 21:52:19 (16108): Setting Memory Size for VM. (4096MB)
2013-09-13 21:52:19 (16108): Setting Chipset Options for VM.
2013-09-13 21:52:20 (16108): Setting Boot Options for VM.
2013-09-13 21:52:20 (16108): Setting Network Configuration for VM.
2013-09-13 21:52:21 (16108): Disabling USB Support for VM.
2013-09-13 21:52:21 (16108): Disabling COM Port Support for VM.
2013-09-13 21:52:21 (16108): Disabling LPT Port Support for VM.
2013-09-13 21:52:21 (16108): Disabling Audio Support for VM.
2013-09-13 21:52:22 (16108): Disabling Clipboard Support for VM.
2013-09-13 21:52:22 (16108): Disabling Drag and Drop Support for VM.
2013-09-13 21:52:22 (16108): Adding storage controller to VM.
2013-09-13 21:52:22 (16108): Adding virtual disk drive to VM.
2013-09-13 21:52:23 (16108): Enabling shared directory for VM.
2013-09-13 21:52:23 (16108): Starting VM.
2013-09-13 21:52:24 (16108): Successfully started VM.
2013-09-13 21:52:24 (16108): Setting cpu throttle for VM. (100%)
....
2013-09-13 22:52:52 (16108): Powering off VM.
2013-09-13 22:52:52 (16108): Successfully powered off VM.
2013-09-13 22:54:23 (15600): Detected: VirtualBox 4.2.16r86992
2013-09-13 22:54:23 (15600): Restore from previously saved snapshot.
2013-09-13 22:54:24 (15600): Restore completed.
2013-09-13 22:54:24 (15600): Starting VM.
2013-09-13 22:54:25 (15600): Successfully started VM.
2013-09-13 22:54:25 (15600): Setting cpu throttle for VM. (100%)
....
2013-09-14 00:14:28 (15600): Creating new snapshot for VM.
2013-09-14 00:14:30 (15600): Deleting stale snapshot.
2013-09-14 00:14:31 (15600): Checkpoint completed.
2013-09-14 00:16:20 (14856): Detected: VirtualBox 4.2.16r86992
2013-09-14 00:16:20 (14856): Powering off VM.
2013-09-14 00:16:21 (14856): Successfully powered off VM.
2013-09-14 00:16:21 (14856): Restore from previously saved snapshot.
2013-09-14 00:16:21 (14856): Restore completed.
2013-09-14 00:16:21 (14856): Starting VM.
2013-09-14 00:16:23 (14856): Successfully started VM.
2013-09-14 00:16:23 (14856): Setting cpu throttle for VM. (100%)
2013-09-14 00:27:09 (5720): Detected: VirtualBox 4.2.16r86992
2013-09-14 00:27:10 (5720): Restore from previously saved snapshot.
2013-09-14 00:27:10 (5720): Restore completed.
2013-09-14 00:27:10 (5720): Starting VM.
2013-09-14 00:27:16 (5720): Successfully started VM.
2013-09-14 00:27:16 (5720): Setting cpu throttle for VM. (100%)
....
2013-09-14 13:37:30 (5720): Creating new snapshot for VM.
2013-09-14 13:37:32 (5720): Deleting stale snapshot.
2013-09-14 13:37:33 (5720): Checkpoint completed.
2013-09-14 13:47:30 (5720): Creating new snapshot for VM.
2013-09-14 13:47:32 (5720): Deleting stale snapshot.
2013-09-14 13:47:32 (5720): Checkpoint completed.
2013-09-14 13:49:58 (5720): VM is no longer is a running state. It is in
'poweroff'.
2013-09-14 13:50:43 (5720): Powering off VM.
2013-09-14 13:50:43 (5720): Deregistering VM.
2013-09-14 13:50:43 (5720): Deleting stale snapshot.
2013-09-14 13:50:43 (5720): Removing storage controller(s) from VM.
2013-09-14 13:50:44 (5720): Removing VM from VirtualBox.
2013-09-14 13:50:44 (5720): Removing virtual disk drive from VirtualBox.
2013-09-14 13:50:49 (5720): VM Premature Shutdown Detected.
Hypervisor System Log:
evice attached to device slot 1 on port 1 of controller 'Hard Disk
Controller'}, preserve=false
....
13:22:46.576786 Changing the VM state from 'DESTROYING' to 'TERMINATED'.
VM Exit Code: 0 (0x0)
13:50:49 (5720): called boinc_finish_______________________________________________
boinc_dev mailing list
[email protected]
http://lists.ssl.berkeley.edu/mailman/listinfo/boinc_dev
To unsubscribe, visit the above URL and
(near bottom of page) enter your email address.