I've posted 26035 which has the fix you suggested.

 

Sorry about that.

 

----- Rom

 

From: Christian Beer [mailto:[email protected]] 
Sent: Saturday, November 09, 2013 4:13 AM
To: Rom Walton
Cc: BOINCDev Mailing List; David Anderson (BOINC); Lammert van der Veen
Subject: Re: ongoing problems with vboxwrapper (NO_OPTION bug)

 

Rom:

I'm testing the 26032 on RNAWorld and I just got this:



2013-11-09 08:23:12 (4106): vboxwrapper: starting
2013-11-09 08:23:12 (4106): Feature: Enabling trickle-ups (Interval:
14400.000000)
2013-11-09 08:23:12 (4106): Detected: VirtualBox 4.2.16_Debianr86992
2013-11-09 08:23:16 (4106): Restore from previously saved snapshot.
2013-11-09 08:23:18 (4106): Restore completed.
2013-11-09 08:23:18 (4106): Starting VM.
2013-11-09 08:23:31 (4106): Successfully started VM.
2013-11-09 08:23:31 (4106): Setting cpu throttle for VM. (100%)
2013-11-09 08:23:31 (4106): Setting network throttle for VM.
2013-11-09 08:38:26 (4106): Status Report: Job Duration: '0.000000',
Elapsed Time: '10207.734515'
2013-11-09 09:47:41 (4106): Status Report: Trickle-Up Event.
2013-11-09 09:47:41 (4106): Sending Trickle-Up Event failed (-191).

-191 means ERR_NO_OPTION which leads us back to a problem 5 months ago (
http://boinc.berkeley.edu/trac/changeset/655fd5e429442f574124b12ff0396d9
df4d42d2f/boinc-v2/samples/vboxwrapper) which was fixed than.

You broke this with commit:
http://boinc.berkeley.edu/trac/changeset/4820cb5436cc98c15c39fac58e2220a
1db8a8cc4/boinc-v2/samples/vboxwrapper which puts the
boinc_init_options() call before setting
boinc_options.handle_trickle_ups = true; again. So we are never able to
send trickle messages.

I would propose to change the init block at line 407 like this:



    memset(&boinc_options, 0, sizeof(boinc_options));
    boinc_options.main_program = true;
    boinc_options.check_heartbeat = true;
    boinc_options.handle_process_control = true;
    if (trickle_period > 0.0) {
        boinc_options.handle_trickle_ups = true;
    }
    boinc_init_options(&boinc_options);


So we still get a printf message after starting but set the correct
option before calling boinc_init_options().

Regards
Christian


Am 08.11.2013 05:14, schrieb Rom Walton:

        Christian,

         

        26032 is up on http://boinc.berkeley.edu/dl/.

         

        It includes:

        VBOX: Add logging in case of a trickle-up failure.

        VBOX: Adjust the VM process priority right before a suspend
command to speed up how quickly the VM is suspended.

         

        ----- Rom

         

        From: Christian Beer [mailto:[email protected]] 
        Sent: Thursday, November 07, 2013 6:45 AM
        To: Rom Walton
        Cc: BOINCDev Mailing List; David Anderson (BOINC); Lammert van
der Veen
        Subject: Re: ongoing problems with vboxwrapper

         

        Hi Rom,
        
        can you please move the new trickle status printf down after
boinc_send_trickle_up() and check the return value of this? This is more
helpful in case the trickle didn't get send and the reason is in the
log. I will deploy this new version on RNAWorld and create some long
running tasks then.
        
        @David: Is there an easy way for me to assign this results to
specific users? I want to focus on users that I can contact in our forum
to check on the stderr.txt during runtime.
        
        Regards
        Christian
        
        Am 07.11.2013 00:53, schrieb Rom Walton:

                I've posted 26031 to http://boinc.berkeley.edu/dl/.

                 

                It contains the following changes:

                VBOX: Use the same technique for calculating when to
report a trickle as we use for performing checkpoints.

                VBOX: Add a trickle-up status report entry to stderr.txt
every time we send a trickle event.

                VBOX: Add VirtualBox 4.3.0 to bad builds list.

                VBOX: We only need to filter the vboxmanage output in
one place.

                VBOX: Add additional check to determine if the get VM
log command really failed.

                 

                ----- Rom

                 

                From: Christian Beer [mailto:[email protected]] 
                Sent: Wednesday, November 06, 2013 11:29 AM
                To: Rom Walton
                Cc: BOINCDev Mailing List; David Anderson (BOINC);
Lammert van der Veen
                Subject: Re: ongoing problems with vboxwrapper

                 

                Possible explanation. The current control script is
buggy and does not redirect the scientific app's stdout and stderr to
files so it ends up in the VM log. But this is happening on all tasks
not just on some.
                
                Regards
                Christian
                
                Am 06.11.2013 17:24, schrieb Rom Walton:

                        Ah, so we are tripping up on the new code to
check for EXIT_OUT_OF_MEMORY. Fun fun fun.

                         

                        Okay, I'll commit a change for this.

                         

                        I'm not sure why vboxmanage would be returning a
non-zero exit status in this situation.

                         

                        ----- Rom

                         

                        From: Christian Beer [mailto:[email protected]]

                        Sent: Wednesday, November 06, 2013 11:11 AM
                        To: Rom Walton
                        Cc: BOINCDev Mailing List; David Anderson
(BOINC); Lammert van der Veen
                        Subject: Re: ongoing problems with vboxwrapper

                         

                        It seems that the general scheme seems to be
this (see the User's post:
https://www.rechenkraft.net/forum/viewtopic.php?f=76&t=13059&start=180#p
143176):
                        
                        
                        
                        
                        

                        2013-10-17 11:21:40 (816): Creating new snapshot
for VM.
                        2013-10-17 11:21:48 (816): Deleting stale
snapshot.
                        2013-10-17 11:21:49 (816): Checkpoint completed.
                        2013-10-17 11:26:42 (816): Error in get vm log
for VM: 3
                        Arguments:
                        showvminfo "boinc_fb76a72fc6655131" --log 0 
                        Output:
                        VirtualBox VM 4.2.16 r86992 win.amd64 (Jul  4
2013 15:51:44) release log
                        00:00:00.042649 Log opened
2013-10-16T18:39:34.043413700Z

                        followed by the actual VM Log (rather long) and
this over and over:
                        
                        
                        
                        
                        

                        05:44:49.794013 ********************* End of
CFGM dump **********************
                        05:44:50.074438 Changing the VM state from
'SUSPENDED' to 'RESUMING'.
                        05:44:50.074495 Changing the VM state from
'RESUMING' to 'RUNNING'.
                        05:44:55.318224 Changing the VM state from
'RUNNING' to 'SUSPENDING'.
                        05:44:55.774983 PDMR3Suspend: 456 736 124 ns run
time
                        05:44:55.775003 Changing the VM state from
'SUSPENDING' to 'SUSPENDED'.
                        05:44:55.775847 DrvBlock: Flushes will be
ignored
                        05:44:55.775855 DrvBlock: Async flushes will be
passed to the disk
                        05:44:55.776116 VD: Opening the disk took 236410
ns
                        05:44:55.776131 PIIX3 ATA: LUN#0: disk,
PCHS=4161/16/63, total number of sectors 4194304
                        05:44:55.776139 ************************* CFGM
dump *************************
                        05:44:55.776140 [/Devices/piix3ide/0/] (level 0)
                        05:44:55.776142   PCIBusNo      <integer> =
0x0000000000000000 (0)
                        05:44:55.776145   PCIDeviceNo   <integer> =
0x0000000000000001 (1)
                        05:44:55.776146   PCIFunctionNo <integer> =
0x0000000000000001 (1)
                        05:44:55.776147   Trusted       <integer> =
0x0000000000000001 (1)
                        05:44:55.776148 
                        05:44:55.776149 [/Devices/piix3ide/0/Config/]
(level 1) (restricted root)
                        05:44:55.776151   Type <string>  = "PIIX4"
(cb=6)
                        05:44:55.776152 
                        05:44:55.776153
[/Devices/piix3ide/0/Config/PrimaryMaster/] (level 2)
                        05:44:55.776155   NonRotationalMedium <integer>
= 0x0000000000000000 (0)
                        05:44:55.776156 
                        05:44:55.776156 [/Devices/piix3ide/0/LUN#0/]
(level 1)
                        05:44:55.776158   Driver <string>  = "Block"
(cb=6)
                        05:44:55.776159 
                        05:44:55.776159
[/Devices/piix3ide/0/LUN#0/AttachedDriver/] (level 2)
                        05:44:55.776161   Driver <string>  = "VD" (cb=3)
                        05:44:55.776162 
                        05:44:55.776163
[/Devices/piix3ide/0/LUN#0/AttachedDriver/Config/] (level 3) (restricted
root)
                        05:44:55.776165   Format     <string>  = "VDI"
(cb=4)
                        05:44:55.776166   Path       <string>  =
"D:\ProgramData\BOINC\slots\9\boinc_fb76a72fc6655131\Snapshots\{650bac36
-f84e-43c7-b30c-c8a078244a51}.vdi" (cb=105)
                        05:44:55.776167   SetupMerge <integer> =
0x0000000000000001 (1)
                        05:44:55.776168   Type       <string>  =
"HardDisk" (cb=9)
                        05:44:55.776169 
                        05:44:55.776170
[/Devices/piix3ide/0/LUN#0/AttachedDriver/Config/Parent/] (level 4)
                        05:44:55.776172   Format      <string>  = "VDI"
(cb=4)
                        05:44:55.776173   MergeSource <integer> =
0x0000000000000001 (1)
                        05:44:55.776174   Path        <string>  =
"D:\ProgramData\BOINC\slots\9\boinc_fb76a72fc6655131\Snapshots\{ed98fbf7
-ab54-4f4c-97d9-ec954d59d419}.vdi" (cb=105)
                        05:44:55.776176 
                        05:44:55.776176
[/Devices/piix3ide/0/LUN#0/AttachedDriver/Config/Parent/Parent/] (level
5)
                        05:44:55.776178   Format      <string>  = "VDI"
(cb=4)
                        05:44:55.776179   MergeTarget <integer> =
0x0000000000000001 (1)
                        05:44:55.776180   Path        <string>  =
"D:\ProgramData\BOINC\slots\9\vm_image.vdi" (cb=42)
                        05:44:55.776182 
                        05:44:55.776182
[/Devices/piix3ide/0/LUN#0/Config/] (level 2) (restricted root)
                        05:44:55.776184   Mountable <integer> =
0x0000000000000000 (0)
                        05:44:55.776185   Type      <string>  =
"HardDisk" (cb=9)
                        05:44:55.776186 
                        05:44:55.776187 [/Devices/piix3ide/0/LUN#999/]
(level 1)
                        05:44:55.776188   Driver <string>  =
"MainStatus" (cb=11)
                        05:44:55.776189 
                        05:44:55.776190
[/Devices/piix3ide/0/LUN#999/Config/] (level 2) (restricted root)
                        05:44:55.776192   DeviceInstance        <string>
= "piix3ide/0" (cb=11)
                        05:44:55.776193   First
<integer> = 0x0000000000000000 (0)
                        05:44:55.776194   Last
<integer> = 0x0000000000000003 (3)
                        05:44:55.776196   pConsole
<integer> = 0x0000000001cc8280 (30 179 968)
                        05:44:55.776198   papLeds
<integer> = 0x0000000001cc8598 (30 180 760)
                        05:44:55.776200   pmapMediumAttachments
<integer> = 0x0000000001cc88a0 (30 181 536)
                        05:44:55.776201 
                        05:44:55.776202 ********************* End of
CFGM dump **********************
                        05:44:55.776213 Changing the VM state from
'SUSPENDED' to 'RESUMING'.

                        Another user reported that after a host restart
the growth of stderr.txt was normal again.
                        
                        Regards
                        Christian
                        
                        Am 06.11.2013 16:59, schrieb Rom Walton:

                                Lammert, what did you discover?

                                 

                                Christian, do you happen to know what
kind of messages stderr.txt was filled with?

                                 

                                Vboxwrapper uses wall clock time
internally.  I'll see what I can find about the trickle messages.

                                 

                                ----- Rom

                                 

                                From: Christian Beer
[mailto:[email protected]] 
                                Sent: Wednesday, November 06, 2013 10:30
AM
                                To: BOINCDev Mailing List
                                Cc: Rom Walton; David Anderson (BOINC);
Lammert van der Veen
                                Subject: ongoing problems with
vboxwrapper

                                 

                                Hello,
                                
                                we are running the 26028 version of the
vboxwrapper for some time now and I want to update you on some ongoing
problems.
                                
                                Some users reported that the stderr.txt
is filled with lots of error messages and file size increases to several
GB. The file was truncated by the user and I didn't see any unusual
disk_size_limit_reached errors. So either this was an isolated incident
or the file size doesn't matter.
                                
                                Many users reported that the VM is still
running after the BOINC Client was shut down. Lammert van der Veen did
some research to the cause and I hope this can be fixed by limiting one
concurrent VM per Host and the 26031 wrapper as soon as I upgrade our
application.
                                
                                Trickle messages were working fine when
running with short tasks. Now that we have some longer tasks the trickle
up messages stopped. We didn't receive any in over a month. I think I
have an explanation for this:
                                In the vboxwrapper the trickles are
generated every X seconds cpu_time and not wall_clock_time and as the
vboxwrapper is not doing much the cpu_time increases very slowly. What I
want is a trickle message every X hours of VM runtime! Please look into
this asap because without this feature I have to monitor the deadlines
and extend them by hand.
                                If it may be helpful:
                                A recent long running task reported back
with cpu_time=793517.4 and elapsed_time=839064.946827 but I also have a
task with cpu_time=4280.246 and elapsed_time=378984.324896 I can't see
any trickle messages for both of them.
                                first:
http://www.rnaworld.de/rnaworld/result.php?resultid=14920843 (Job
Duration is always 0, Elapsed time is increasing)
                                second:
http://www.rnaworld.de/rnaworld/result.php?resultid=14921349 (can't see
anything in stderr)
                                
                                Speaking of deadlines, it would also be
great to update the deadline on the client. I know of one user who
updates his client_state.xml by hand to prevent the Client from going
into high priority mode for RNAWorld when there is no need.
                                
                                Regards
                                Christian

                         

                 

         

 

_______________________________________________
boinc_dev mailing list
[email protected]
http://lists.ssl.berkeley.edu/mailman/listinfo/boinc_dev
To unsubscribe, visit the above URL and
(near bottom of page) enter your email address.

Reply via email to