Okay, I've updated vboxwrapper so that future VM names will be based off of slot directory IDs. I have also set things up so that the description field for the VM should contain the task name on newer versions of VirtualBox.
I do not see any obvious reason why the task was being considered a crash. I've changed the wrapper so that if it detects a VM state we know isn't a crash we reset the crash variable to false. x86: http://boinc.berkeley.edu/dl/vboxwrapper_26026_windows_intelx86.zip x64: http://boinc.berkeley.edu/dl/vboxwrapper_26026_windows_x86_64.zip ----- Rom -----Original Message----- From: Christian Beer [mailto:[email protected]] Sent: Wednesday, September 18, 2013 2:57 AM To: Rom Walton Cc: [email protected]; BOINCDev Mailing List Subject: Re: [boinc_dev] preliminary findings from RNA World VirtualBox Betatest Hi Rom, I'm all for shortening the VM name. Maybe you could use something like projectname_x where x is the slot ID. This way the volunteers can still see what Project is running the VM. But this may still be problematic with long projectnames. So I would rather vote for the boinc_slot_x name. As for the premature shutdown I attached you some vboxwrapper log extracts that show the problematic and successful results (I have 15 results with error 194 and pretty long stderr_out. I can send them to you if needed). I've seen the problems with 7.2.5, 7.0.64, 7.0.28 ( all Windows) I'm running the vboxwrapper_26024 I don't specify any timeouts for the vboxwrapper. May vbox_job.xml is: <vbox_job> <os_name>Debian_64</os_name> <memory_size_mb>4096</memory_size_mb> <enable_shared_directory/> <fraction_done_filename>shared/progress.txt</fraction_done_filename> </vbox_job> Regards Christian Am 17.09.2013 23:54, schrieb Rom Walton: > Christian, > > All we need for the VM name is something that is unique, we could > change the scheme to boinc_slot_x where x is the slot ID or something > like that. I originally went with the result name so that volunteers > could match up the VM with the task that was executing. > > Do you happen to have any more information about the premature shutdown? > Are you using a time limit for your results? IIRC, the way we > originally envisioned things is that when vboxwrapper detects that the > VM shutdown without the wrapper itself telling the VM to shutdown it > assumes the result is complete. > > If you are seeing the 'premature' text, it implies that a timeout > value has been specified. > > ----- Rom > > > -----Original Message----- > From: boinc_dev [mailto:[email protected]] On Behalf > Of Christian Beer > Sent: Tuesday, September 17, 2013 2:21 PM > To: [email protected]; BOINCDev Mailing List > Subject: [boinc_dev] preliminary findings from RNA World VirtualBox > Betatest > > Hello, > > during the last 7days I ran a Betatest on RNA World with the new > VirtualBox feature of BOINC. These are some preliminary findings that > I would like to present to the BOINC developers and project communities. > For whom it is TL;DR let me summarize: I present some problems we ran > into, using the vboxwrapper and possible solutions that I want to > discuss at the upcoming BOINC workshop in Grenoble. > > First let me distinguish our usage of the vboxwrapper from existing > Projects that are using the same technology. Test4Theory is using the > VB feature to distribute a CERN-own VM to volunteer computers but is > not sending work via BOINC. They use some other work distribution > technologie that is outside of BOINC. I couldn't find how they grant > credit. > Climate@home is also using vboxwrapper with a rather large VM and > uploads a temporary result every hour. It seems they grant a fixed > amount of credit per workunit. > > The main purpose for RNA World to implement the vboxwrapper was the > lack of native checkpointing by our generic application. As none of > the other projects seem to use this feature it is crucial to us that > it works. At first I found out that if we run the scientific > application, controlled by a bash script, in the shared/ directory on > the host that the bash script will end with an error if the VM get's restored from a snapshot. > The solution is to move the computation inside the VM and just copy > the output files into the shared/ directory when finished. This works > very well for us. > > Another thing I noticed in the Betatest is a pathlength issue with > Windows systems. At the moment the vboxwrapper creates a VM for every > task using the taskname as identifier. As RNA World uses rather large > tasknames this shortly triggered the 256 character path length limit > on Windows. I circumvented this by shortening the taskname but we > loose some information that is useful for volunteers, because they > want to know what bacteria or virus their computer is working on. I > don't consider this a real solution at the moment. > > The next thing that I noticed during the test was that under some > unknown circumstances the vboxwrapper detects a premature shutdown of > the VM during the normal shutdown at the end of the task. This gets > reported with exit_status=194 and considered an error. Nevertheless > are the output files already present and get uploaded to the project server. > I now manually reset the task's outcome to success but I would like to > not do this in the future. So whatever happens after the vboxwrapper > detects a normal shutdown of the VM it should not be considered an > error per se as the output files may be present and valid (which is > later checked on the server if outcome=success). > > The test was focused on 64bit machines with more than 4GB of RAM so I > set memory_size_mb in vbox_job.xml to 4096 and rsc_memory_bound to > 4.2e9 in the input template. As it turned out this was recognized by > the scheduler and only machines that had more than 4GB RAM got work. > But nevertheless the CLient didn't take this into account as I had > four RNA World VMs on my Intel Core i5 laptop with only 5GB RAM > running at the same time. When all VMs where powering up at the same > time my computer got very laggy (which was also reported by other users in the forums). > In the end three of the VMs errored out with different messages and > only one is still crunching. I can only presume that it is the lack of > memory that is causing this because I also have a desktop machine with > 16GB of RAM and no errors and no lags. So what would be a solution to > us is that the Client will not start any VBox task if there is not > enough memory available. If this should be the current behaviour we > may have to run a more extensive test on this issue. > > I also noticed that Credit seems to be calculated using cpu_time > rather than elapsed_time (which is way higher) is this intentional? We > just recently switched to CreditNew and are also granting Credit for > our non-VM application that reports a high cpu_time thus more Credit > compared to the VM app. This is not critical at the moment but as soon > as we go into production with this app it is. > > Another point that I would like to have fixed in the future is the > inability for the science app to communicate with the vboxwrapper. At > the moment only the fraction_done value is read from a file that I > specify in the vbox_job.xml and the init_data.xml can be shared with > the VM using a floppy disk image. What I want is to trigger a specific > abort (error code) from within the VM. As our application is generic > and we can't say how much RAM the task needs we may have to reschedule > the task with a higher rsc_memory_bound and memory_size_mb. At the > moment I try to change our vbox_control script to detect this case and > shutdown the VM and somehow tweak the output files that our validator > can recognize this. What I would like to have is another file in > shared/ that I can fill with an error_code and the vboxwrapper is > transmitting this value as exit_status. Now the scheduler can handle > this and send a new result with a higher rsc_memory_bound. > > That's it for this first Betatest. > > Regards > Christian > _______________________________________________ > boinc_dev mailing list > [email protected] > http://lists.ssl.berkeley.edu/mailman/listinfo/boinc_dev > To unsubscribe, visit the above URL and (near bottom of page) enter > your email address. _______________________________________________ boinc_dev mailing list [email protected] http://lists.ssl.berkeley.edu/mailman/listinfo/boinc_dev To unsubscribe, visit the above URL and (near bottom of page) enter your email address.
