Re: [bug #14182] System tests failures depend on the actual machine

Sébastien Morin Tue, 16 Mar 2010 07:45:15 -0700

Hi Edward,

I just tested the Gentoo machines again using relax-1.3.4 and minf-1.0.2.


Three 32 bit machines I tested completed the test-suite without any 
error. Two other machines I previously had in my possession are no 
longer available...

However, one 64 bit machine failed for one test, always with the same 
values:

====================
FAIL: Constrained Newton opt, GMW Hessian mod, More and Thuente line 
search {S2=0.970, te=2048, Rex=0.149}

...

relax> minimise(*args=('newton',), func_tol=1e-25, 
max_iterations=10000000, constraints=True, scaling=True, verbosity=1)
Simulation 1
Simulation 2
Simulation 3

relax> monte_carlo.error_analysis(prune=0.0)
Traceback (most recent call last):
   File "/home/semor/relax-1.3.4/test_suite/system_tests/model_free.py", 
line 610, in test_opt_constr_newton_gmw_mt_S2_0_970_te_2048_Rex_0_149
     self.value_test(spin, select, s2, te, rex, chi2, iter, f_count, 
g_count, h_count, warning)
   File "/home/semor/relax-1.3.4/test_suite/system_tests/model_free.py", 
line 1110, in value_test
     self.assertEqual(spin.f_count, f_count, msg=mesg)
AssertionError: Optimisation failure.

System: Linux
Release: 2.6.20-gentoo-r7
Version: #1 SMP Sat Apr 28 23:31:52 Local time zone must be set--see zic
Win32 version:
Distribution: gentoo 1.12.13
Architecture: 64bit ELF
Machine: x86_64
Processor: Intel(R) Xeon(R) CPU 5160 @ 3.00GHz
Python version: 2.6.4
numpy version: 1.3.0


s2:       0.9699999999999994
te:       2048.0000000000446
rex:      0.14900000000001615
chi2:     8.3312601381368332e-28
iter:     22
f_count:  91
g_count:  91
h_count:  22
warning:  None
====================

Regards,


Séb  :)



On 10-02-21 9:00 AM, Edward d'Auvergne wrote:
> Is it different for the different machines, or is it different each
> time on the same machine?  If you give a range of numbers for the
> optimisation results, these tests could be relaxed a little.
>
> Cheers,
>
> Edward
>
>
> On 21 February 2010 14:41, Sébastien Morin<[email protected]>  
> wrote:
>    
>> Hi Ed,
>>
>> I agree with you that this is not an important issue given the small
>> variations observed...
>>
>> I was just still a bit annoyed by this happening on our Gentoo systems...
>>
>> But maybe this is just because of Gentoo itself, as in Gentoo almost
>> everything is compiled locally, so every system is different because of all
>> the variables that can be changed that affect compilation...
>>
>> Ok, let's forget all this !
>>
>> Regards,
>>
>>
>> Séb
>>
>>
>> On 10-02-21 8:32 AM, Edward d'Auvergne wrote:
>>      
>>> Hi,
>>>
>>> The code is not parallelised as most optimisation algorithms are not
>>> amenable to parallelisation.  There's a lot of research in that field,
>>> but the code here is not along these lines.  Do you still see this
>>> problem?  Maybe it is a bug in this specific version of the GCC
>>> compiler which created the python executable?  Does it occur on
>>> machines with a different Gentoo versions installed?  Can you
>>> reproduce the error in a virtual machine?  This is a fixed code path
>>> and cannot in any way be different upon different runs of the test
>>> suite.  It doesn't change on all the Mandriva installs I have, all the
>>> Macs it has been tested on, or even on the Windows virtual image I use
>>> to build and test relax on Windows.  I've even tested it on Solaris
>>> without problems!  In any case, this bug is definitely machine
>>> specific and not related to relax itself.  Sorry, I don't know what
>>> else I can do to try to track this down.  Maybe your CPUs are doing
>>> some strange frequency scaling depending on load, and that is causing
>>> this bizarre behaviour?  In any case, this is not an issue for relax
>>> execution and only affects the precision of optimisation in a small
>>> way.
>>>
>>> Regards,
>>>
>>> Edward
>>>
>>>
>>>
>>> On 21 February 2010 05:34, Sébastien Morin<[email protected]>
>>>   wrote:
>>>
>>>        
>>>> Hi Ed,
>>>>
>>>> This has been a long time since we discussed about this...
>>>>
>>>> However, talking with Olivier last week, we discussed about one
>>>> possibility
>>>> to explain this issue. Is the code in question in some way parallelized,
>>>> i.e. are there multiple processes running at the same time with their
>>>> results being combined subsequently ? If yes, there could be conditions
>>>> in
>>>> which the problem could arise either because of variations in allocated
>>>> memory or cpu that would change the timing between the different
>>>> processes,
>>>> hence affecting the final result...
>>>>
>>>> Does that make sens ?
>>>>
>>>> Olivier, is this what you explained me last week ?
>>>>
>>>>
>>>> Sébastien
>>>>
>>>>
>>>> On 09-09-14 3:30 AM, Edward d'Auvergne wrote:
>>>>
>>>>          
>>>>> Hi,
>>>>>
>>>>> I've been trying to work out what is happening, but it is a complete
>>>>> mystery to me.  The algorithms are fixed in stone - I coded them
>>>>> myself and you can see it in the minfx code.  They are standard
>>>>> optimisation algorithms that obey fixed rules.  On the same machine it
>>>>> must, without question, give the same result every time!  If it
>>>>> doesn't, something is wrong with the machine, either hardward or
>>>>> software.  Would it be possible to install an earlier python and numpy
>>>>> version (maybe 2.5 and 1.2.1 respectively) to see if that makes a
>>>>> difference?  Or maybe it is the Linux kernel doing some strange things
>>>>> with the CPU - maybe switching between power profiles causing the CPU
>>>>> floating point math precision to change?  Are you 100% sure that all
>>>>> computers give variable results (between each run), and not that they
>>>>> just give a different fixed result each time?  Maybe there is a
>>>>> non-fatal kernel bug not triggered by Oliver's hardward?
>>>>>
>>>>> Regards,
>>>>>
>>>>> Edward
>>>>>
>>>>>
>>>>> P.S.  A note to others reading this - this problem is not serious for
>>>>> relax's optimisation!
>>>>>
>>>>>
>>>>> 2009/9/4 Sébastien Morin<[email protected]>:
>>>>>
>>>>>
>>>>>            
>>>>>> Hi Ed,
>>>>>>
>>>>>> (I added Olivier Fisette in CC as he is quite computer knowledgeable
>>>>>> and
>>>>>> could help us rationalize this issue...)
>>>>>>
>>>>>> This strange behavior was observed for my laptop and the two other
>>>>>> computers in the lab with the failures in the system tests (i.e. for
>>>>>> the
>>>>>> three computers of the bug report).
>>>>>>
>>>>>> I performed some of the different tests proposed on the following page:
>>>>>>     ->
>>>>>>   http://www.gentoo.org/doc/en/articles/hardware-stability-p1.xml
>>>>>>           (tested CPU with infinite rebuild of kernel using gcc for 4
>>>>>> hours)
>>>>>>           (tested CPU with cpuburn-1.4 for XXXX hours)
>>>>>>           (tested RAM with memtester-4.0.7 for>    6 hours)
>>>>>> to check the CPU and RAM, but did not find anything... Of course, these
>>>>>> tests may not have uncovered potential problems in my CPU and RAM, but
>>>>>> most chances are they are fine. Moreover, the problem being observed
>>>>>> for
>>>>>> three different computers, it would be surprising that hardware
>>>>>> failures
>>>>>> occur in these three machines...
>>>>>>
>>>>>> The three systems run Gentoo Linux with kernel-2.6.30, numpy-1.3.0 and
>>>>>> python-2.6.2. However, the fourth computer to which I have access (for
>>>>>> Olivier: this computer is 'hibou'), and which passes the system tests
>>>>>> properly, also runs Gentoo Linux with kernel 2.6.30, numpy-1.3.0 and
>>>>>> python-2.6.2...
>>>>>>
>>>>>> A potential option could be that some kernel configuration is causing
>>>>>> these problems...
>>>>>>
>>>>>> Another option would be that, although the algorithms are supposedly
>>>>>> fixed, that they are not...
>>>>>>
>>>>>> I could check if the calculations diverge always at the same step and,
>>>>>> if so, try to see what function is problematic...
>>>>>>
>>>>>> Other ideas ?
>>>>>>
>>>>>> Do you know any other minimisation library with which I could test to
>>>>>> see if these computers indeed give rise to changing results or if this
>>>>>> is limited to relax (and minfx) ?
>>>>>>
>>>>>> Regards,
>>>>>>
>>>>>>
>>>>>> Séb  :)
>>>>>>
>>>>>>
>>>>>>
>>>>>> Edward d'Auvergne wrote:
>>>>>>
>>>>>>
>>>>>>              
>>>>>>> Hi,
>>>>>>>
>>>>>>> This is very strange, very strange indeed!  I've never seen anything
>>>>>>> quite like this.  Is it only your laptop that is giving this variable
>>>>>>> result?  I'm pretty sure that it's not related to a random seed
>>>>>>> because the optimisation at no point uses random numbers - it is 100%
>>>>>>> fixed, pre-determined, etc. and should never, ever vary (well on
>>>>>>> different machines it will change, but never on the same machine).
>>>>>>> What is the operating system on the laptop?  Can you run a ram
>>>>>>> checking program or anything else to diagnose hardware failures?
>>>>>>> Maybe the CPU is overheating?  Apart from hardware problems, since you
>>>>>>> never recompile Python or numpy between these tests I cannot think of
>>>>>>> anything else that could possibly cause this.
>>>>>>>
>>>>>>> Cheers,
>>>>>>>
>>>>>>> Edward
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> 2009/9/3 Sébastien Morin<[email protected]>:
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>                
>>>>>>>> Hi Ed,
>>>>>>>>
>>>>>>>> I've just tried what you proposed and observed something quite
>>>>>>>> strange...
>>>>>>>>
>>>>>>>> Here are the results:
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>                  
>>>>>>>>> ./relax scripts/optimisation_testing.py>    /dev/null
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>                    
>>>>>>>>   (stats from my laptop, different trials, see below)
>>>>>>>>     iter      161   147   151
>>>>>>>>     f_count   765   620   591
>>>>>>>>     g_count   168   152   158
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>                  
>>>>>>>>> ./relax -s
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>                    
>>>>>>>>   (stats from my laptop, different trials, see below)
>>>>>>>>     iter      146   159   160   159
>>>>>>>>     f_count   708   721   649   673
>>>>>>>>     g_count   152   166   167   166
>>>>>>>>
>>>>>>>>
>>>>>>>> Problem 1:
>>>>>>>> The results should be the same in both situations, right ?
>>>>>>>>
>>>>>>>> Problem 2:
>>>>>>>> The results should not vary when the test is done multiple times,
>>>>>>>> right
>>>>>>>> ?
>>>>>>>>
>>>>>>>>
>>>>>>>> I have tested different things to find out why the tests give rise to
>>>>>>>> different results as a function of time...
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>                  
>>>>>>>>> ./relax scripts/optimisation_testing.py>    /dev/null
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>                    
>>>>>>>>     If you modify the file "test_suite/system_tests/__init__.py", then
>>>>>>>> the result will be different. By modifying, I mean just comment a few
>>>>>>>> lines in the run() function. (I usually do that when I want to speed
>>>>>>>> up
>>>>>>>> the process of testing a specific issue.) Maybe this behavior is
>>>>>>>> related
>>>>>>>> to random seed based on the code files...
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>                  
>>>>>>>>> ./relax -s
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>                    
>>>>>>>>     This one varies as a function of time without any change. Just
>>>>>>>> doing
>>>>>>>> the test several times in a row will have it varying... Maybe this
>>>>>>>> behavior is related to random seed based on the date and time...
>>>>>>>>
>>>>>>>>
>>>>>>>> Any idea ?
>>>>>>>>
>>>>>>>> If you want, Ed, I could create you an account on one of these
>>>>>>>> strange-behaving computers...
>>>>>>>>
>>>>>>>>
>>>>>>>> Regards,
>>>>>>>>
>>>>>>>>
>>>>>>>> Séb
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> Edward d'Auvergne wrote:
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>                  
>>>>>>>>> Hi,
>>>>>>>>>
>>>>>>>>> I've now written a script so that you can fix this.  Try running:
>>>>>>>>>
>>>>>>>>> ./relax scripts/optimisation_testing.py>    /dev/null
>>>>>>>>>
>>>>>>>>> This will give you all the info you need, formatted ready for
>>>>>>>>> copying
>>>>>>>>> and pasting into the correct file.  This is currently only
>>>>>>>>> 'test_suite/system_tests/model_free.py'.  Just paste the
>>>>>>>>> pre-formatted
>>>>>>>>> python comment into the correct test, and add the different values
>>>>>>>>> to
>>>>>>>>> the list of values checked.
>>>>>>>>>
>>>>>>>>> Cheers,
>>>>>>>>>
>>>>>>>>> Edward
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> 2009/9/3 Sébastien Morin<[email protected]>:
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>                    
>>>>>>>>>> Hi Ed,
>>>>>>>>>>
>>>>>>>>>> I just checked my original mail
>>>>>>>>>> (https://mail.gna.org/public/relax-devel/2009-05/msg00003.html).
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> For the failure "FAIL: Constrained BFGS opt, backtracking line
>>>>>>>>>> search
>>>>>>>>>> {S2=0.970, te=2048, Rex=0.149}", the counts were initially as
>>>>>>>>>> follows:
>>>>>>>>>>     f_count   386
>>>>>>>>>>     g_count   386
>>>>>>>>>> and are now:
>>>>>>>>>>     f_count   743   694   761
>>>>>>>>>>     g_count   168   172   164
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> For the failure "FAIL: Constrained BFGS opt, More and Thuente line
>>>>>>>>>> search {S2=0.970, te=2048, Rex=0.149}", the counts were initially
>>>>>>>>>> as
>>>>>>>>>> follows:
>>>>>>>>>>     f_count   722
>>>>>>>>>>     g_count   164
>>>>>>>>>> and are now:
>>>>>>>>>>     f_count   375   322   385
>>>>>>>>>>     g_count   375   322   385
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> The different values given for the "just-measured" parameters
>>>>>>>>>> account
>>>>>>>>>> for the 3 different computers I have access to that give rise to
>>>>>>>>>> these
>>>>>>>>>> two annoying failures...
>>>>>>>>>>
>>>>>>>>>> I wounder if the names of the tests in the original mail were not
>>>>>>>>>> mixed,
>>>>>>>>>> as numbers just measured in the second test seem closer to those
>>>>>>>>>> originally posted in the first test, and vice versa...
>>>>>>>>>>
>>>>>>>>>> Anyway, the problem is that there are variations between the
>>>>>>>>>> different
>>>>>>>>>> machines. Variations are also present for the other parameters (s2,
>>>>>>>>>> te,
>>>>>>>>>> rex, chi2, iter).
>>>>>>>>>>
>>>>>>>>>> Regards,
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Séb  :)
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Edward d'Auvergne wrote:
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>                      
>>>>>>>>>>> Hi,
>>>>>>>>>>>
>>>>>>>>>>> Could you check and see if the numbers are exactly the same as in
>>>>>>>>>>> your
>>>>>>>>>>> original email
>>>>>>>>>>> (https://mail.gna.org/public/relax-devel/2009-05/msg00003.html)?
>>>>>>>>>>>   Specifically look at f_count and g_count.
>>>>>>>>>>>
>>>>>>>>>>> Cheers,
>>>>>>>>>>>
>>>>>>>>>>> Edward
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> 2009/9/2 Sébastien Morin<[email protected]>:
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>                        
>>>>>>>>>>>> Hi Ed,
>>>>>>>>>>>>
>>>>>>>>>>>> I updated my svn copies to r9432 and checked if the problem was
>>>>>>>>>>>> still
>>>>>>>>>>>> present.
>>>>>>>>>>>>
>>>>>>>>>>>> Unfortunately, it is still present...
>>>>>>>>>>>>
>>>>>>>>>>>> Regards,
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> Séb
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> Edward d'Auvergne wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>                          
>>>>>>>>>>>>> Hi,
>>>>>>>>>>>>>
>>>>>>>>>>>>> Ah, yes, there is a reason.  I went through and fixed a series
>>>>>>>>>>>>> of
>>>>>>>>>>>>> these optimisation difference issues - in my local svn copy.  I
>>>>>>>>>>>>> collected these all together and committed them as one after I
>>>>>>>>>>>>> had
>>>>>>>>>>>>> shut the bugs.  This was a few minutes ago at r9426.  If you
>>>>>>>>>>>>> update
>>>>>>>>>>>>> and test now, it should work.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Cheers,
>>>>>>>>>>>>>
>>>>>>>>>>>>> Edward
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> 2009/9/2 Sébastien Morin<[email protected]>:
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>                            
>>>>>>>>>>>>>> Hi Ed,
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> I just tested the for the presence of this bug (1.3 repository,
>>>>>>>>>>>>>> r9425)
>>>>>>>>>>>>>> and it seems it is still there...
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Is there a reason why it was closed ?
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>                              
>>>>>>>>>>>>>>>    From the data I have, I guess this bug report should be
>>>>>>>>>>>>>>> re-opened.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>                                
>>>>>>>>>>>>>> Maybe I could try to give more details to help debugging...
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Séb  :)
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Edward d Auvergne wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>                              
>>>>>>>>>>>>>>> Update of bug #14182 (project relax):
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>                    Status:               Confirmed =>    Fixed
>>>>>>>>>>>>>>>               Assigned to:                    None =>    bugman
>>>>>>>>>>>>>>>               Open/Closed:                    Open =>    Closed
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>      _______________________________________________________
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Reply to this item at:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>    <http://gna.org/bugs/?14182>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> _______________________________________________
>>>>>>>>>>>>>>>    Message sent via/by Gna!
>>>>>>>>>>>>>>>    http://gna.org/
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>                                
>>>>>>>>>>>>>> --
>>>>>>>>>>>>>> Sébastien Morin
>>>>>>>>>>>>>> PhD Student
>>>>>>>>>>>>>> S. Gagné NMR Laboratory
>>>>>>>>>>>>>> Université Laval&    PROTEO
>>>>>>>>>>>>>> Québec, Canada
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>                              
>>>>>>>>>>>> --
>>>>>>>>>>>> Sébastien Morin
>>>>>>>>>>>> PhD Student
>>>>>>>>>>>> S. Gagné NMR Laboratory
>>>>>>>>>>>> Université Laval&    PROTEO
>>>>>>>>>>>> Québec, Canada
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>                          
>>>>>>>>>> --
>>>>>>>>>> Sébastien Morin
>>>>>>>>>> PhD Student
>>>>>>>>>> S. Gagné NMR Laboratory
>>>>>>>>>> Université Laval&    PROTEO
>>>>>>>>>> Québec, Canada
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>                      
>>>>>>
>>>>>>              
>>>>>
>>>>>            
>>>> --
>>>> Sébastien Morin
>>>> PhD Student
>>>> S. Gagné NMR Laboratory
>>>> Université Laval&    PROTEO
>>>> Québec, Canada
>>>>
>>>>
>>>>
>>>>          
>> --
>> Sébastien Morin
>> PhD Student
>> S. Gagné NMR Laboratory
>> Université Laval&   PROTEO
>> Québec, Canada
>>
>>
>>      

-- 
Sébastien Morin
PhD Student
S. Gagné NMR Laboratory
Université Laval&  PROTEO
Québec, Canada


_______________________________________________
relax (http://nmr-relax.com)

This is the relax-devel mailing list
[email protected]

To unsubscribe from this list, get a password
reminder, or change your subscription options,
visit the list information page at
https://mail.gna.org/listinfo/relax-devel

Re: [bug #14182] System tests failures depend on the actual machine

Reply via email to