Rom: I'm testing the 26032 on RNAWorld and I just got this: > 2013-11-09 08:23:12 (4106): vboxwrapper: starting > 2013-11-09 08:23:12 (4106): Feature: Enabling trickle-ups (Interval: > 14400.000000) > 2013-11-09 08:23:12 (4106): Detected: VirtualBox 4.2.16_Debianr86992 > 2013-11-09 08:23:16 (4106): Restore from previously saved snapshot. > 2013-11-09 08:23:18 (4106): Restore completed. > 2013-11-09 08:23:18 (4106): Starting VM. > 2013-11-09 08:23:31 (4106): Successfully started VM. > 2013-11-09 08:23:31 (4106): Setting cpu throttle for VM. (100%) > 2013-11-09 08:23:31 (4106): Setting network throttle for VM. > 2013-11-09 08:38:26 (4106): Status Report: Job Duration: '0.000000', > Elapsed Time: '10207.734515' > 2013-11-09 09:47:41 (4106): Status Report: Trickle-Up Event. > 2013-11-09 09:47:41 (4106): Sending Trickle-Up Event failed (-191). -191 means ERR_NO_OPTION which leads us back to a problem 5 months ago ( http://boinc.berkeley.edu/trac/changeset/655fd5e429442f574124b12ff0396d9df4d42d2f/boinc-v2/samples/vboxwrapper) which was fixed than.
You broke this with commit: http://boinc.berkeley.edu/trac/changeset/4820cb5436cc98c15c39fac58e2220a1db8a8cc4/boinc-v2/samples/vboxwrapper which puts the boinc_init_options() call before setting boinc_options.handle_trickle_ups = true; again. So we are never able to send trickle messages. I would propose to change the init block at line 407 like this: > memset(&boinc_options, 0, sizeof(boinc_options)); > boinc_options.main_program = true; > boinc_options.check_heartbeat = true; > boinc_options.handle_process_control = true; > if (trickle_period > 0.0) { > boinc_options.handle_trickle_ups = true; > } > boinc_init_options(&boinc_options); So we still get a printf message after starting but set the correct option before calling boinc_init_options(). Regards Christian Am 08.11.2013 05:14, schrieb Rom Walton: > > Christian, > > > > 26032 is up on http://boinc.berkeley.edu/dl/. > > > > It includes: > > *VBOX: Add logging in case of a trickle-up failure.* > > *VBOX: Adjust the VM process priority right before a suspend command > to speed up how quickly the VM is suspended.* > > * * > > *----- Rom* > > > > *From:*Christian Beer [mailto:[email protected]] > *Sent:* Thursday, November 07, 2013 6:45 AM > *To:* Rom Walton > *Cc:* BOINCDev Mailing List; David Anderson (BOINC); Lammert van der Veen > *Subject:* Re: ongoing problems with vboxwrapper > > > > Hi Rom, > > can you please move the new trickle status printf down after > boinc_send_trickle_up() and check the return value of this? This is > more helpful in case the trickle didn't get send and the reason is in > the log. I will deploy this new version on RNAWorld and create some > long running tasks then. > > @David: Is there an easy way for me to assign this results to specific > users? I want to focus on users that I can contact in our forum to > check on the stderr.txt during runtime. > > Regards > Christian > > Am 07.11.2013 00:53, schrieb Rom Walton: > > I've posted 26031 to http://boinc.berkeley.edu/dl/. > > > > It contains the following changes: > > VBOX: Use the same technique for calculating when to report a > trickle as we use for performing checkpoints. > > VBOX: Add a trickle-up status report entry to stderr.txt every > time we send a trickle event. > > VBOX: Add VirtualBox 4.3.0 to bad builds list. > > VBOX: We only need to filter the vboxmanage output in one place. > > VBOX: Add additional check to determine if the get VM log command > really failed. > > > > ----- Rom > > > > *From:*Christian Beer [mailto:[email protected]] > *Sent:* Wednesday, November 06, 2013 11:29 AM > *To:* Rom Walton > *Cc:* BOINCDev Mailing List; David Anderson (BOINC); Lammert van > der Veen > *Subject:* Re: ongoing problems with vboxwrapper > > > > Possible explanation. The current control script is buggy and does > not redirect the scientific app's stdout and stderr to files so it > ends up in the VM log. But this is happening on all tasks not just > on some. > > Regards > Christian > > Am 06.11.2013 17:24, schrieb Rom Walton: > > Ah, so we are tripping up on the new code to check for > EXIT_OUT_OF_MEMORY. Fun fun fun. > > > > Okay, I'll commit a change for this. > > > > I'm not sure why vboxmanage would be returning a non-zero exit > status in this situation. > > > > ----- Rom > > > > *From:*Christian Beer [mailto:[email protected]] > *Sent:* Wednesday, November 06, 2013 11:11 AM > *To:* Rom Walton > *Cc:* BOINCDev Mailing List; David Anderson (BOINC); Lammert > van der Veen > *Subject:* Re: ongoing problems with vboxwrapper > > > > It seems that the general scheme seems to be this (see the > User's post: > > https://www.rechenkraft.net/forum/viewtopic.php?f=76&t=13059&start=180#p143176): > > > > 2013-10-17 11:21:40 (816): Creating new snapshot for VM. > 2013-10-17 11:21:48 (816): Deleting stale snapshot. > 2013-10-17 11:21:49 (816): Checkpoint completed. > 2013-10-17 11:26:42 (816): Error in get vm log for VM: 3 > Arguments: > showvminfo "boinc_fb76a72fc6655131" --log 0 > Output: > VirtualBox VM 4.2.16 r86992 win.amd64 (Jul 4 2013 15:51:44) > release log > 00:00:00.042649 Log opened 2013-10-16T18:39:34.043413700Z > > followed by the actual VM Log (rather long) and this over and > over: > > > > 05:44:49.794013 ********************* End of CFGM dump > ********************** > 05:44:50.074438 Changing the VM state from 'SUSPENDED' to > 'RESUMING'. > 05:44:50.074495 Changing the VM state from 'RESUMING' to > 'RUNNING'. > 05:44:55.318224 Changing the VM state from 'RUNNING' to > 'SUSPENDING'. > 05:44:55.774983 PDMR3Suspend: 456 736 124 ns run time > 05:44:55.775003 Changing the VM state from 'SUSPENDING' to > 'SUSPENDED'. > 05:44:55.775847 DrvBlock: Flushes will be ignored > 05:44:55.775855 DrvBlock: Async flushes will be passed to the disk > 05:44:55.776116 VD: Opening the disk took 236410 ns > 05:44:55.776131 PIIX3 ATA: LUN#0: disk, PCHS=4161/16/63, total > number of sectors 4194304 > 05:44:55.776139 ************************* CFGM dump > ************************* > 05:44:55.776140 [/Devices/piix3ide/0/] (level 0) > 05:44:55.776142 PCIBusNo <integer> = 0x0000000000000000 (0) > 05:44:55.776145 PCIDeviceNo <integer> = 0x0000000000000001 (1) > 05:44:55.776146 PCIFunctionNo <integer> = 0x0000000000000001 (1) > 05:44:55.776147 Trusted <integer> = 0x0000000000000001 (1) > 05:44:55.776148 > 05:44:55.776149 [/Devices/piix3ide/0/Config/] (level 1) > (restricted root) > 05:44:55.776151 Type <string> = "PIIX4" (cb=6) > 05:44:55.776152 > 05:44:55.776153 [/Devices/piix3ide/0/Config/PrimaryMaster/] > (level 2) > 05:44:55.776155 NonRotationalMedium <integer> = > 0x0000000000000000 (0) > 05:44:55.776156 > 05:44:55.776156 [/Devices/piix3ide/0/LUN#0/] (level 1) > 05:44:55.776158 Driver <string> = "Block" (cb=6) > 05:44:55.776159 > 05:44:55.776159 [/Devices/piix3ide/0/LUN#0/AttachedDriver/] > (level 2) > 05:44:55.776161 Driver <string> = "VD" (cb=3) > 05:44:55.776162 > 05:44:55.776163 > [/Devices/piix3ide/0/LUN#0/AttachedDriver/Config/] (level 3) > (restricted root) > 05:44:55.776165 Format <string> = "VDI" (cb=4) > 05:44:55.776166 Path <string> = > > "D:\ProgramData\BOINC\slots\9\boinc_fb76a72fc6655131\Snapshots\{650bac36-f84e-43c7-b30c-c8a078244a51}.vdi" > (cb=105) > 05:44:55.776167 SetupMerge <integer> = 0x0000000000000001 (1) > 05:44:55.776168 Type <string> = "HardDisk" (cb=9) > 05:44:55.776169 > 05:44:55.776170 > [/Devices/piix3ide/0/LUN#0/AttachedDriver/Config/Parent/] > (level 4) > 05:44:55.776172 Format <string> = "VDI" (cb=4) > 05:44:55.776173 MergeSource <integer> = 0x0000000000000001 (1) > 05:44:55.776174 Path <string> = > > "D:\ProgramData\BOINC\slots\9\boinc_fb76a72fc6655131\Snapshots\{ed98fbf7-ab54-4f4c-97d9-ec954d59d419}.vdi" > (cb=105) > 05:44:55.776176 > 05:44:55.776176 > [/Devices/piix3ide/0/LUN#0/AttachedDriver/Config/Parent/Parent/] > (level 5) > 05:44:55.776178 Format <string> = "VDI" (cb=4) > 05:44:55.776179 MergeTarget <integer> = 0x0000000000000001 (1) > 05:44:55.776180 Path <string> = > "D:\ProgramData\BOINC\slots\9\vm_image.vdi" (cb=42) > 05:44:55.776182 > 05:44:55.776182 [/Devices/piix3ide/0/LUN#0/Config/] (level 2) > (restricted root) > 05:44:55.776184 Mountable <integer> = 0x0000000000000000 (0) > 05:44:55.776185 Type <string> = "HardDisk" (cb=9) > 05:44:55.776186 > 05:44:55.776187 [/Devices/piix3ide/0/LUN#999/] (level 1) > 05:44:55.776188 Driver <string> = "MainStatus" (cb=11) > 05:44:55.776189 > 05:44:55.776190 [/Devices/piix3ide/0/LUN#999/Config/] (level > 2) (restricted root) > 05:44:55.776192 DeviceInstance <string> = > "piix3ide/0" (cb=11) > 05:44:55.776193 First <integer> = > 0x0000000000000000 (0) > 05:44:55.776194 Last <integer> = > 0x0000000000000003 (3) > 05:44:55.776196 pConsole <integer> = > 0x0000000001cc8280 (30 179 968) > 05:44:55.776198 papLeds <integer> = > 0x0000000001cc8598 (30 180 760) > 05:44:55.776200 pmapMediumAttachments <integer> = > 0x0000000001cc88a0 (30 181 536) > 05:44:55.776201 > 05:44:55.776202 ********************* End of CFGM dump > ********************** > 05:44:55.776213 Changing the VM state from 'SUSPENDED' to > 'RESUMING'. > > Another user reported that after a host restart the growth of > stderr.txt was normal again. > > Regards > Christian > > Am 06.11.2013 16:59, schrieb Rom Walton: > > Lammert, what did you discover? > > > > Christian, do you happen to know what kind of messages > stderr.txt was filled with? > > > > Vboxwrapper uses wall clock time internally. I'll see > what I can find about the trickle messages. > > > > ----- Rom > > > > *From:*Christian Beer [mailto:[email protected]] > *Sent:* Wednesday, November 06, 2013 10:30 AM > *To:* BOINCDev Mailing List > *Cc:* Rom Walton; David Anderson (BOINC); Lammert van der Veen > *Subject:* ongoing problems with vboxwrapper > > > > Hello, > > we are running the 26028 version of the vboxwrapper for > some time now and I want to update you on some ongoing > problems. > > Some users reported that the stderr.txt is filled with > lots of error messages and file size increases to several > GB. The file was truncated by the user and I didn't see > any unusual disk_size_limit_reached errors. So either this > was an isolated incident or the file size doesn't matter. > > Many users reported that the VM is still running after the > BOINC Client was shut down. Lammert van der Veen did some > research to the cause and I hope this can be fixed by > limiting one concurrent VM per Host and the 26031 wrapper > as soon as I upgrade our application. > > Trickle messages were working fine when running with short > tasks. Now that we have some longer tasks the trickle up > messages stopped. We didn't receive any in over a month. I > think I have an explanation for this: > In the vboxwrapper the trickles are generated every X > seconds cpu_time and not wall_clock_time and as the > vboxwrapper is not doing much the cpu_time increases very > slowly. What I want is a trickle message every X hours of > VM runtime! Please look into this asap because without > this feature I have to monitor the deadlines and extend > them by hand. > If it may be helpful: > A recent long running task reported back with > cpu_time=793517.4 and elapsed_time=839064.946827 but I > also have a task with cpu_time=4280.246 and > elapsed_time=378984.324896 I can't see any trickle > messages for both of them. > first: > http://www.rnaworld.de/rnaworld/result.php?resultid=14920843 > (Job Duration is always 0, Elapsed time is increasing) > second: > http://www.rnaworld.de/rnaworld/result.php?resultid=14921349 > (can't see anything in stderr) > > Speaking of deadlines, it would also be great to update > the deadline on the client. I know of one user who updates > his client_state.xml by hand to prevent the Client from > going into high priority mode for RNAWorld when there is > no need. > > Regards > Christian > > > > > > > _______________________________________________ boinc_dev mailing list [email protected] http://lists.ssl.berkeley.edu/mailman/listinfo/boinc_dev To unsubscribe, visit the above URL and (near bottom of page) enter your email address.
