Hi Rom, can you please move the new trickle status printf down after boinc_send_trickle_up() and check the return value of this? This is more helpful in case the trickle didn't get send and the reason is in the log. I will deploy this new version on RNAWorld and create some long running tasks then.
@David: Is there an easy way for me to assign this results to specific users? I want to focus on users that I can contact in our forum to check on the stderr.txt during runtime. Regards Christian Am 07.11.2013 00:53, schrieb Rom Walton: > > I've posted 26031 to http://boinc.berkeley.edu/dl/. > > > > It contains the following changes: > > VBOX: Use the same technique for calculating when to report a trickle > as we use for performing checkpoints. > > VBOX: Add a trickle-up status report entry to stderr.txt every time we > send a trickle event. > > VBOX: Add VirtualBox 4.3.0 to bad builds list. > > VBOX: We only need to filter the vboxmanage output in one place. > > VBOX: Add additional check to determine if the get VM log command > really failed. > > > > ----- Rom > > > > *From:*Christian Beer [mailto:[email protected]] > *Sent:* Wednesday, November 06, 2013 11:29 AM > *To:* Rom Walton > *Cc:* BOINCDev Mailing List; David Anderson (BOINC); Lammert van der Veen > *Subject:* Re: ongoing problems with vboxwrapper > > > > Possible explanation. The current control script is buggy and does not > redirect the scientific app's stdout and stderr to files so it ends up > in the VM log. But this is happening on all tasks not just on some. > > Regards > Christian > > Am 06.11.2013 17:24, schrieb Rom Walton: > > Ah, so we are tripping up on the new code to check for > EXIT_OUT_OF_MEMORY. Fun fun fun. > > > > Okay, I'll commit a change for this. > > > > I'm not sure why vboxmanage would be returning a non-zero exit > status in this situation. > > > > ----- Rom > > > > *From:*Christian Beer [mailto:[email protected]] > *Sent:* Wednesday, November 06, 2013 11:11 AM > *To:* Rom Walton > *Cc:* BOINCDev Mailing List; David Anderson (BOINC); Lammert van > der Veen > *Subject:* Re: ongoing problems with vboxwrapper > > > > It seems that the general scheme seems to be this (see the User's > post: > > https://www.rechenkraft.net/forum/viewtopic.php?f=76&t=13059&start=180#p143176): > > > 2013-10-17 11:21:40 (816): Creating new snapshot for VM. > 2013-10-17 11:21:48 (816): Deleting stale snapshot. > 2013-10-17 11:21:49 (816): Checkpoint completed. > 2013-10-17 11:26:42 (816): Error in get vm log for VM: 3 > Arguments: > showvminfo "boinc_fb76a72fc6655131" --log 0 > Output: > VirtualBox VM 4.2.16 r86992 win.amd64 (Jul 4 2013 15:51:44) > release log > 00:00:00.042649 Log opened 2013-10-16T18:39:34.043413700Z > > followed by the actual VM Log (rather long) and this over and over: > > > 05:44:49.794013 ********************* End of CFGM dump > ********************** > 05:44:50.074438 Changing the VM state from 'SUSPENDED' to 'RESUMING'. > 05:44:50.074495 Changing the VM state from 'RESUMING' to 'RUNNING'. > 05:44:55.318224 Changing the VM state from 'RUNNING' to 'SUSPENDING'. > 05:44:55.774983 PDMR3Suspend: 456 736 124 ns run time > 05:44:55.775003 Changing the VM state from 'SUSPENDING' to > 'SUSPENDED'. > 05:44:55.775847 DrvBlock: Flushes will be ignored > 05:44:55.775855 DrvBlock: Async flushes will be passed to the disk > 05:44:55.776116 VD: Opening the disk took 236410 ns > 05:44:55.776131 PIIX3 ATA: LUN#0: disk, PCHS=4161/16/63, total > number of sectors 4194304 > 05:44:55.776139 ************************* CFGM dump > ************************* > 05:44:55.776140 [/Devices/piix3ide/0/] (level 0) > 05:44:55.776142 PCIBusNo <integer> = 0x0000000000000000 (0) > 05:44:55.776145 PCIDeviceNo <integer> = 0x0000000000000001 (1) > 05:44:55.776146 PCIFunctionNo <integer> = 0x0000000000000001 (1) > 05:44:55.776147 Trusted <integer> = 0x0000000000000001 (1) > 05:44:55.776148 > 05:44:55.776149 [/Devices/piix3ide/0/Config/] (level 1) > (restricted root) > 05:44:55.776151 Type <string> = "PIIX4" (cb=6) > 05:44:55.776152 > 05:44:55.776153 [/Devices/piix3ide/0/Config/PrimaryMaster/] (level 2) > 05:44:55.776155 NonRotationalMedium <integer> = > 0x0000000000000000 (0) > 05:44:55.776156 > 05:44:55.776156 [/Devices/piix3ide/0/LUN#0/] (level 1) > 05:44:55.776158 Driver <string> = "Block" (cb=6) > 05:44:55.776159 > 05:44:55.776159 [/Devices/piix3ide/0/LUN#0/AttachedDriver/] (level 2) > 05:44:55.776161 Driver <string> = "VD" (cb=3) > 05:44:55.776162 > 05:44:55.776163 [/Devices/piix3ide/0/LUN#0/AttachedDriver/Config/] > (level 3) (restricted root) > 05:44:55.776165 Format <string> = "VDI" (cb=4) > 05:44:55.776166 Path <string> = > > "D:\ProgramData\BOINC\slots\9\boinc_fb76a72fc6655131\Snapshots\{650bac36-f84e-43c7-b30c-c8a078244a51}.vdi" > (cb=105) > 05:44:55.776167 SetupMerge <integer> = 0x0000000000000001 (1) > 05:44:55.776168 Type <string> = "HardDisk" (cb=9) > 05:44:55.776169 > 05:44:55.776170 > [/Devices/piix3ide/0/LUN#0/AttachedDriver/Config/Parent/] (level 4) > 05:44:55.776172 Format <string> = "VDI" (cb=4) > 05:44:55.776173 MergeSource <integer> = 0x0000000000000001 (1) > 05:44:55.776174 Path <string> = > > "D:\ProgramData\BOINC\slots\9\boinc_fb76a72fc6655131\Snapshots\{ed98fbf7-ab54-4f4c-97d9-ec954d59d419}.vdi" > (cb=105) > 05:44:55.776176 > 05:44:55.776176 > [/Devices/piix3ide/0/LUN#0/AttachedDriver/Config/Parent/Parent/] > (level 5) > 05:44:55.776178 Format <string> = "VDI" (cb=4) > 05:44:55.776179 MergeTarget <integer> = 0x0000000000000001 (1) > 05:44:55.776180 Path <string> = > "D:\ProgramData\BOINC\slots\9\vm_image.vdi" (cb=42) > 05:44:55.776182 > 05:44:55.776182 [/Devices/piix3ide/0/LUN#0/Config/] (level 2) > (restricted root) > 05:44:55.776184 Mountable <integer> = 0x0000000000000000 (0) > 05:44:55.776185 Type <string> = "HardDisk" (cb=9) > 05:44:55.776186 > 05:44:55.776187 [/Devices/piix3ide/0/LUN#999/] (level 1) > 05:44:55.776188 Driver <string> = "MainStatus" (cb=11) > 05:44:55.776189 > 05:44:55.776190 [/Devices/piix3ide/0/LUN#999/Config/] (level 2) > (restricted root) > 05:44:55.776192 DeviceInstance <string> = "piix3ide/0" > (cb=11) > 05:44:55.776193 First <integer> = > 0x0000000000000000 (0) > 05:44:55.776194 Last <integer> = > 0x0000000000000003 (3) > 05:44:55.776196 pConsole <integer> = > 0x0000000001cc8280 (30 179 968) > 05:44:55.776198 papLeds <integer> = > 0x0000000001cc8598 (30 180 760) > 05:44:55.776200 pmapMediumAttachments <integer> = > 0x0000000001cc88a0 (30 181 536) > 05:44:55.776201 > 05:44:55.776202 ********************* End of CFGM dump > ********************** > 05:44:55.776213 Changing the VM state from 'SUSPENDED' to 'RESUMING'. > > Another user reported that after a host restart the growth of > stderr.txt was normal again. > > Regards > Christian > > Am 06.11.2013 16:59, schrieb Rom Walton: > > Lammert, what did you discover? > > > > Christian, do you happen to know what kind of messages > stderr.txt was filled with? > > > > Vboxwrapper uses wall clock time internally. I'll see what I > can find about the trickle messages. > > > > ----- Rom > > > > *From:*Christian Beer [mailto:[email protected]] > *Sent:* Wednesday, November 06, 2013 10:30 AM > *To:* BOINCDev Mailing List > *Cc:* Rom Walton; David Anderson (BOINC); Lammert van der Veen > *Subject:* ongoing problems with vboxwrapper > > > > Hello, > > we are running the 26028 version of the vboxwrapper for some > time now and I want to update you on some ongoing problems. > > Some users reported that the stderr.txt is filled with lots of > error messages and file size increases to several GB. The file > was truncated by the user and I didn't see any unusual > disk_size_limit_reached errors. So either this was an isolated > incident or the file size doesn't matter. > > Many users reported that the VM is still running after the > BOINC Client was shut down. Lammert van der Veen did some > research to the cause and I hope this can be fixed by limiting > one concurrent VM per Host and the 26031 wrapper as soon as I > upgrade our application. > > Trickle messages were working fine when running with short > tasks. Now that we have some longer tasks the trickle up > messages stopped. We didn't receive any in over a month. I > think I have an explanation for this: > In the vboxwrapper the trickles are generated every X seconds > cpu_time and not wall_clock_time and as the vboxwrapper is not > doing much the cpu_time increases very slowly. What I want is > a trickle message every X hours of VM runtime! Please look > into this asap because without this feature I have to monitor > the deadlines and extend them by hand. > If it may be helpful: > A recent long running task reported back with > cpu_time=793517.4 and elapsed_time=839064.946827 but I also > have a task with cpu_time=4280.246 and > elapsed_time=378984.324896 I can't see any trickle messages > for both of them. > first: > http://www.rnaworld.de/rnaworld/result.php?resultid=14920843 > (Job Duration is always 0, Elapsed time is increasing) > second: > http://www.rnaworld.de/rnaworld/result.php?resultid=14921349 > (can't see anything in stderr) > > Speaking of deadlines, it would also be great to update the > deadline on the client. I know of one user who updates his > client_state.xml by hand to prevent the Client from going into > high priority mode for RNAWorld when there is no need. > > Regards > Christian > > > > > _______________________________________________ boinc_dev mailing list [email protected] http://lists.ssl.berkeley.edu/mailman/listinfo/boinc_dev To unsubscribe, visit the above URL and (near bottom of page) enter your email address.
