Second try to get the mail to the mailing lists. I think my webmailer was sending an html mail that got scrubbed by the list. Hi Rom, thanks for this quick reaction. I started some more test units in the meantime and have some new errors already. Could you please take a look at these three that are not correctly recovering from a snapshot? http://www.rnaworld.de/rnaworld/result.php?resultid=14919688 http://www.rnaworld.de/rnaworld/result.php?resultid=14919685 http://www.rnaworld.de/rnaworld/result.php?resultid=14919684
I excluded VirtualBox 4.2.18 via plan_class, are you aware of the problem with snapshots there? We also have some logs of our first round of tests if you need them. On another sidenote I get requests to publish the vboxwrapper for Mac OS? I read on T4T that there were problems but they now have a fixed vboxwrapper for Mac. What's your version number for this fixed version? Another thing that comes to my mind right now is the trickle issue. You can disregard my mail about problems with the Windows version as those hosts are sending trikle messages to the server but don't say this in the startup messages from vboxwrapper. It would be great if we would see the same output in the Windows and Linux version of the vboxwrapper. Regards Christian Gesendet: Mittwoch, 18. September 2013 um 22:41 Uhr Von: "Rom Walton" <[email protected]> An: "Christian Beer" <[email protected]> Cc: [email protected], "BOINCDev Mailing List" <[email protected]> Betreff: RE: [boinc_dev] preliminary findings from RNA World VirtualBox Betatest Okay, I've updated vboxwrapper so that future VM names will be based off of slot directory IDs. I have also set things up so that the description field for the VM should contain the task name on newer versions of VirtualBox. I do not see any obvious reason why the task was being considered a crash. I've changed the wrapper so that if it detects a VM state we know isn't a crash we reset the crash variable to false. x86: http://boinc.berkeley.edu/dl/vboxwrapper_26026_windows_intelx86.zip[http://boinc.berkeley.edu/dl/vboxwrapper_26026_windows_intelx86.zip] x64: http://boinc.berkeley.edu/dl/vboxwrapper_26026_windows_x86_64.zip[http://boinc.berkeley.edu/dl/vboxwrapper_26026_windows_x86_64.zip] ----- Rom -----Original Message----- From: Christian Beer [mailto:[email protected]] Sent: Wednesday, September 18, 2013 2:57 AM To: Rom Walton Cc: [email protected]; BOINCDev Mailing List Subject: Re: [boinc_dev] preliminary findings from RNA World VirtualBox Betatest Hi Rom, I'm all for shortening the VM name. Maybe you could use something like projectname_x where x is the slot ID. This way the volunteers can still see what Project is running the VM. But this may still be problematic with long projectnames. So I would rather vote for the boinc_slot_x name. As for the premature shutdown I attached you some vboxwrapper log extracts that show the problematic and successful results (I have 15 results with error 194 and pretty long stderr_out. I can send them to you if needed). I've seen the problems with 7.2.5, 7.0.64, 7.0.28 ( all Windows) I'm running the vboxwrapper_26024 I don't specify any timeouts for the vboxwrapper. May vbox_job.xml is: <vbox_job> <os_name>Debian_64</os_name> <memory_size_mb>4096</memory_size_mb> <enable_shared_directory/> <fraction_done_filename>shared/progress.txt</fraction_done_filename> </vbox_job> Regards Christian Am 17.09.2013 23:54, schrieb Rom Walton: > Christian, > > All we need for the VM name is something that is unique, we could > change the scheme to boinc_slot_x where x is the slot ID or something > like that. I originally went with the result name so that volunteers > could match up the VM with the task that was executing. > > Do you happen to have any more information about the premature shutdown? > Are you using a time limit for your results? IIRC, the way we > originally envisioned things is that when vboxwrapper detects that the > VM shutdown without the wrapper itself telling the VM to shutdown it > assumes the result is complete. > > If you are seeing the 'premature' text, it implies that a timeout > value has been specified. > > ----- Rom > > > -----Original Message----- > From: boinc_dev [mailto:[email protected]] On Behalf > Of Christian Beer > Sent: Tuesday, September 17, 2013 2:21 PM > To: [email protected]; BOINCDev Mailing List > Subject: [boinc_dev] preliminary findings from RNA World VirtualBox > Betatest > > Hello, > > during the last 7days I ran a Betatest on RNA World with the new > VirtualBox feature of BOINC. These are some preliminary findings that > I would like to present to the BOINC developers and project communities. > For whom it is TL;DR let me summarize: I present some problems we ran > into, using the vboxwrapper and possible solutions that I want to > discuss at the upcoming BOINC workshop in Grenoble. > > First let me distinguish our usage of the vboxwrapper from existing > Projects that are using the same technology. Test4Theory is using the > VB feature to distribute a CERN-own VM to volunteer computers but is > not sending work via BOINC. They use some other work distribution > technologie that is outside of BOINC. I couldn't find how they grant > credit. > Climate@home is also using vboxwrapper with a rather large VM and > uploads a temporary result every hour. It seems they grant a fixed > amount of credit per workunit. > > The main purpose for RNA World to implement the vboxwrapper was the > lack of native checkpointing by our generic application. As none of > the other projects seem to use this feature it is crucial to us that > it works. At first I found out that if we run the scientific > application, controlled by a bash script, in the shared/ directory on > the host that the bash script will end with an error if the VM get's restored from a snapshot. > The solution is to move the computation inside the VM and just copy > the output files into the shared/ directory when finished. This works > very well for us. > > Another thing I noticed in the Betatest is a pathlength issue with > Windows systems. At the moment the vboxwrapper creates a VM for every > task using the taskname as identifier. As RNA World uses rather large > tasknames this shortly triggered the 256 character path length limit > on Windows. I circumvented this by shortening the taskname but we > loose some information that is useful for volunteers, because they > want to know what bacteria or virus their computer is working on. I > don't consider this a real solution at the moment. > > The next thing that I noticed during the test was that under some > unknown circumstances the vboxwrapper detects a premature shutdown of > the VM during the normal shutdown at the end of the task. This gets > reported with exit_status=194 and considered an error. Nevertheless > are the output files already present and get uploaded to the project server. > I now manually reset the task's outcome to success but I would like to > not do this in the future. So whatever happens after the vboxwrapper > detects a normal shutdown of the VM it should not be considered an > error per se as the output files may be present and valid (which is > later checked on the server if outcome=success). > > The test was focused on 64bit machines with more than 4GB of RAM so I > set memory_size_mb in vbox_job.xml to 4096 and rsc_memory_bound to > 4.2e9 in the input template. As it turned out this was recognized by > the scheduler and only machines that had more than 4GB RAM got work. > But nevertheless the CLient didn't take this into account as I had > four RNA World VMs on my Intel Core i5 laptop with only 5GB RAM > running at the same time. When all VMs where powering up at the same > time my computer got very laggy (which was also reported by other users in the forums). > In the end three of the VMs errored out with different messages and > only one is still crunching. I can only presume that it is the lack of > memory that is causing this because I also have a desktop machine with > 16GB of RAM and no errors and no lags. So what would be a solution to > us is that the Client will not start any VBox task if there is not > enough memory available. If this should be the current behaviour we > may have to run a more extensive test on this issue. > > I also noticed that Credit seems to be calculated using cpu_time > rather than elapsed_time (which is way higher) is this intentional? We > just recently switched to CreditNew and are also granting Credit for > our non-VM application that reports a high cpu_time thus more Credit > compared to the VM app. This is not critical at the moment but as soon > as we go into production with this app it is. > > Another point that I would like to have fixed in the future is the > inability for the science app to communicate with the vboxwrapper. At > the moment only the fraction_done value is read from a file that I > specify in the vbox_job.xml and the init_data.xml can be shared with > the VM using a floppy disk image. What I want is to trigger a specific > abort (error code) from within the VM. As our application is generic > and we can't say how much RAM the task needs we may have to reschedule > the task with a higher rsc_memory_bound and memory_size_mb. At the > moment I try to change our vbox_control script to detect this case and > shutdown the VM and somehow tweak the output files that our validator > can recognize this. What I would like to have is another file in > shared/ that I can fill with an error_code and the vboxwrapper is > transmitting this value as exit_status. Now the scheduler can handle > this and send a new result with a higher rsc_memory_bound. > > That's it for this first Betatest. > > Regards > Christian > _______________________________________________ > boinc_dev mailing list > [email protected] > http://lists.ssl.berkeley.edu/mailman/listinfo/boinc_dev[http://lists.ssl.berkeley.edu/mailman/listinfo/boinc_dev] > To unsubscribe, visit the above URL and (near bottom of page) enter > your email address. _______________________________________________ boinc_dev mailing list [email protected] http://lists.ssl.berkeley.edu/mailman/listinfo/boinc_dev To unsubscribe, visit the above URL and (near bottom of page) enter your email address.
