Christian,
26032 is up on http://boinc.berkeley.edu/dl/. It includes: VBOX: Add logging in case of a trickle-up failure. VBOX: Adjust the VM process priority right before a suspend command to speed up how quickly the VM is suspended. ----- Rom From: Christian Beer [mailto:[email protected]] Sent: Thursday, November 07, 2013 6:45 AM To: Rom Walton Cc: BOINCDev Mailing List; David Anderson (BOINC); Lammert van der Veen Subject: Re: ongoing problems with vboxwrapper Hi Rom, can you please move the new trickle status printf down after boinc_send_trickle_up() and check the return value of this? This is more helpful in case the trickle didn't get send and the reason is in the log. I will deploy this new version on RNAWorld and create some long running tasks then. @David: Is there an easy way for me to assign this results to specific users? I want to focus on users that I can contact in our forum to check on the stderr.txt during runtime. Regards Christian Am 07.11.2013 00:53, schrieb Rom Walton: I've posted 26031 to http://boinc.berkeley.edu/dl/. It contains the following changes: VBOX: Use the same technique for calculating when to report a trickle as we use for performing checkpoints. VBOX: Add a trickle-up status report entry to stderr.txt every time we send a trickle event. VBOX: Add VirtualBox 4.3.0 to bad builds list. VBOX: We only need to filter the vboxmanage output in one place. VBOX: Add additional check to determine if the get VM log command really failed. ----- Rom From: Christian Beer [mailto:[email protected]] Sent: Wednesday, November 06, 2013 11:29 AM To: Rom Walton Cc: BOINCDev Mailing List; David Anderson (BOINC); Lammert van der Veen Subject: Re: ongoing problems with vboxwrapper Possible explanation. The current control script is buggy and does not redirect the scientific app's stdout and stderr to files so it ends up in the VM log. But this is happening on all tasks not just on some. Regards Christian Am 06.11.2013 17:24, schrieb Rom Walton: Ah, so we are tripping up on the new code to check for EXIT_OUT_OF_MEMORY. Fun fun fun. Okay, I'll commit a change for this. I'm not sure why vboxmanage would be returning a non-zero exit status in this situation. ----- Rom From: Christian Beer [mailto:[email protected]] Sent: Wednesday, November 06, 2013 11:11 AM To: Rom Walton Cc: BOINCDev Mailing List; David Anderson (BOINC); Lammert van der Veen Subject: Re: ongoing problems with vboxwrapper It seems that the general scheme seems to be this (see the User's post: https://www.rechenkraft.net/forum/viewtopic.php?f=76&t=13059&start=180#p 143176): 2013-10-17 11:21:40 (816): Creating new snapshot for VM. 2013-10-17 11:21:48 (816): Deleting stale snapshot. 2013-10-17 11:21:49 (816): Checkpoint completed. 2013-10-17 11:26:42 (816): Error in get vm log for VM: 3 Arguments: showvminfo "boinc_fb76a72fc6655131" --log 0 Output: VirtualBox VM 4.2.16 r86992 win.amd64 (Jul 4 2013 15:51:44) release log 00:00:00.042649 Log opened 2013-10-16T18:39:34.043413700Z followed by the actual VM Log (rather long) and this over and over: 05:44:49.794013 ********************* End of CFGM dump ********************** 05:44:50.074438 Changing the VM state from 'SUSPENDED' to 'RESUMING'. 05:44:50.074495 Changing the VM state from 'RESUMING' to 'RUNNING'. 05:44:55.318224 Changing the VM state from 'RUNNING' to 'SUSPENDING'. 05:44:55.774983 PDMR3Suspend: 456 736 124 ns run time 05:44:55.775003 Changing the VM state from 'SUSPENDING' to 'SUSPENDED'. 05:44:55.775847 DrvBlock: Flushes will be ignored 05:44:55.775855 DrvBlock: Async flushes will be passed to the disk 05:44:55.776116 VD: Opening the disk took 236410 ns 05:44:55.776131 PIIX3 ATA: LUN#0: disk, PCHS=4161/16/63, total number of sectors 4194304 05:44:55.776139 ************************* CFGM dump ************************* 05:44:55.776140 [/Devices/piix3ide/0/] (level 0) 05:44:55.776142 PCIBusNo <integer> = 0x0000000000000000 (0) 05:44:55.776145 PCIDeviceNo <integer> = 0x0000000000000001 (1) 05:44:55.776146 PCIFunctionNo <integer> = 0x0000000000000001 (1) 05:44:55.776147 Trusted <integer> = 0x0000000000000001 (1) 05:44:55.776148 05:44:55.776149 [/Devices/piix3ide/0/Config/] (level 1) (restricted root) 05:44:55.776151 Type <string> = "PIIX4" (cb=6) 05:44:55.776152 05:44:55.776153 [/Devices/piix3ide/0/Config/PrimaryMaster/] (level 2) 05:44:55.776155 NonRotationalMedium <integer> = 0x0000000000000000 (0) 05:44:55.776156 05:44:55.776156 [/Devices/piix3ide/0/LUN#0/] (level 1) 05:44:55.776158 Driver <string> = "Block" (cb=6) 05:44:55.776159 05:44:55.776159 [/Devices/piix3ide/0/LUN#0/AttachedDriver/] (level 2) 05:44:55.776161 Driver <string> = "VD" (cb=3) 05:44:55.776162 05:44:55.776163 [/Devices/piix3ide/0/LUN#0/AttachedDriver/Config/] (level 3) (restricted root) 05:44:55.776165 Format <string> = "VDI" (cb=4) 05:44:55.776166 Path <string> = "D:\ProgramData\BOINC\slots\9\boinc_fb76a72fc6655131\Snapshots\{650bac36 -f84e-43c7-b30c-c8a078244a51}.vdi" (cb=105) 05:44:55.776167 SetupMerge <integer> = 0x0000000000000001 (1) 05:44:55.776168 Type <string> = "HardDisk" (cb=9) 05:44:55.776169 05:44:55.776170 [/Devices/piix3ide/0/LUN#0/AttachedDriver/Config/Parent/] (level 4) 05:44:55.776172 Format <string> = "VDI" (cb=4) 05:44:55.776173 MergeSource <integer> = 0x0000000000000001 (1) 05:44:55.776174 Path <string> = "D:\ProgramData\BOINC\slots\9\boinc_fb76a72fc6655131\Snapshots\{ed98fbf7 -ab54-4f4c-97d9-ec954d59d419}.vdi" (cb=105) 05:44:55.776176 05:44:55.776176 [/Devices/piix3ide/0/LUN#0/AttachedDriver/Config/Parent/Parent/] (level 5) 05:44:55.776178 Format <string> = "VDI" (cb=4) 05:44:55.776179 MergeTarget <integer> = 0x0000000000000001 (1) 05:44:55.776180 Path <string> = "D:\ProgramData\BOINC\slots\9\vm_image.vdi" (cb=42) 05:44:55.776182 05:44:55.776182 [/Devices/piix3ide/0/LUN#0/Config/] (level 2) (restricted root) 05:44:55.776184 Mountable <integer> = 0x0000000000000000 (0) 05:44:55.776185 Type <string> = "HardDisk" (cb=9) 05:44:55.776186 05:44:55.776187 [/Devices/piix3ide/0/LUN#999/] (level 1) 05:44:55.776188 Driver <string> = "MainStatus" (cb=11) 05:44:55.776189 05:44:55.776190 [/Devices/piix3ide/0/LUN#999/Config/] (level 2) (restricted root) 05:44:55.776192 DeviceInstance <string> = "piix3ide/0" (cb=11) 05:44:55.776193 First <integer> = 0x0000000000000000 (0) 05:44:55.776194 Last <integer> = 0x0000000000000003 (3) 05:44:55.776196 pConsole <integer> = 0x0000000001cc8280 (30 179 968) 05:44:55.776198 papLeds <integer> = 0x0000000001cc8598 (30 180 760) 05:44:55.776200 pmapMediumAttachments <integer> = 0x0000000001cc88a0 (30 181 536) 05:44:55.776201 05:44:55.776202 ********************* End of CFGM dump ********************** 05:44:55.776213 Changing the VM state from 'SUSPENDED' to 'RESUMING'. Another user reported that after a host restart the growth of stderr.txt was normal again. Regards Christian Am 06.11.2013 16:59, schrieb Rom Walton: Lammert, what did you discover? Christian, do you happen to know what kind of messages stderr.txt was filled with? Vboxwrapper uses wall clock time internally. I'll see what I can find about the trickle messages. ----- Rom From: Christian Beer [mailto:[email protected]] Sent: Wednesday, November 06, 2013 10:30 AM To: BOINCDev Mailing List Cc: Rom Walton; David Anderson (BOINC); Lammert van der Veen Subject: ongoing problems with vboxwrapper Hello, we are running the 26028 version of the vboxwrapper for some time now and I want to update you on some ongoing problems. Some users reported that the stderr.txt is filled with lots of error messages and file size increases to several GB. The file was truncated by the user and I didn't see any unusual disk_size_limit_reached errors. So either this was an isolated incident or the file size doesn't matter. Many users reported that the VM is still running after the BOINC Client was shut down. Lammert van der Veen did some research to the cause and I hope this can be fixed by limiting one concurrent VM per Host and the 26031 wrapper as soon as I upgrade our application. Trickle messages were working fine when running with short tasks. Now that we have some longer tasks the trickle up messages stopped. We didn't receive any in over a month. I think I have an explanation for this: In the vboxwrapper the trickles are generated every X seconds cpu_time and not wall_clock_time and as the vboxwrapper is not doing much the cpu_time increases very slowly. What I want is a trickle message every X hours of VM runtime! Please look into this asap because without this feature I have to monitor the deadlines and extend them by hand. If it may be helpful: A recent long running task reported back with cpu_time=793517.4 and elapsed_time=839064.946827 but I also have a task with cpu_time=4280.246 and elapsed_time=378984.324896 I can't see any trickle messages for both of them. first: http://www.rnaworld.de/rnaworld/result.php?resultid=14920843 (Job Duration is always 0, Elapsed time is increasing) second: http://www.rnaworld.de/rnaworld/result.php?resultid=14921349 (can't see anything in stderr) Speaking of deadlines, it would also be great to update the deadline on the client. I know of one user who updates his client_state.xml by hand to prevent the Client from going into high priority mode for RNAWorld when there is no need. Regards Christian _______________________________________________ boinc_dev mailing list [email protected] http://lists.ssl.berkeley.edu/mailman/listinfo/boinc_dev To unsubscribe, visit the above URL and (near bottom of page) enter your email address.
