Simon,

In this analysis, are you concerned that some times 80-bit floating point 
registers are used [1] [2]? While if SSE instructions are used the computation 
may only be done at 32-bit?

Brad


[1] 
http://en.wikipedia.org/wiki/Extended_precision#IBM_extended_precision_formats
[2] 
http://stackoverflow.com/questions/5738448/is-float-slower-than-double-does-64-bit-program-run-faster-than-32-bit-program


On Mar 31, 2014, at 10:12 AM, Simon Alexander <[email protected]> wrote:

> Hi Matt,
> 
> As noted in previous msg, the magnitude of the divergence doesn't' seem to be 
> very large on typical usage patterns, so from a practical point of view it 
> might be better to bound it, but I haven't done a very careful analysis yet. 
> 
> The most straightforward way to avoid this sort of thing is to always do the 
> sum the same way regardless of the number of threads. Simplest approach to 
> that is accumulating all the constituents into a (stably) ordered data 
> structure and performing the sum at the end.  Of course, this costs you both 
> the memory and the transaction costs, which may not be acceptable. On the 
> other hand, with enough elements you'll want some partitioning anyway for 
> accuracy, but this is best determined by the data range, not the number of 
> threads.
> 
> Most of the other methods I've used to avoid or mitigate this in the past 
> require detailed analysis of local computation and bounding the error at each 
> step, so you can safely truncate the final result to known digits, and 
> regardless of order the final estimate will be identical.  You can sometimes 
> avoid any speed or space costs this way, but it's fiddly work and has to be 
> reevaluated with any change.
> 
> 
> 
> On Sun, Mar 30, 2014 at 10:07 AM, Matt McCormick <[email protected]> 
> wrote:
> Hi Simon,
> 
> Thanks for taking a look.
> 
> Yes, your assessment is correct.  What is a strategy that would avoid
> this, though?
> 
> More eyes on the optimizers are greatly welcome!
> 
> Thanks,
> Matt
> 
> On Fri, Mar 28, 2014 at 4:57 PM, Simon Alexander <[email protected]> 
> wrote:
> > There is a lot going on here, and I'm not certain that I've got all the
> > moving pieces straight in my mind yet, but I've had an quick look at the
> > implementation now. I believe the Mattes v4 implementation is similar to
> > other metrics it it's approach.
> >
> > As I suggested earlier in the thread: I believe accumulations like this:
> >
> >>  for( ThreadIdType threadID = 1; threadID <
> >> this->GetNumberOfThreadsUsed(); threadID++ )
> >>     {
> >>     this->m_ThreaderJointPDFSum[0] +=
> >> this->m_ThreaderJointPDFSum[threadID];
> >>     }
> >
> >
> > will guarantee that we don't have absolute consistent results between
> > different threadcounts, due to lack of associativity.
> >
> > When I perform only transform initialization and a single evaluation of  the
> > metric (i.e. outside of the registration routines), I get results consistent
> > with this, for example, results for an center-of-mass initialization between
> > two MR image volumes give me (double precision):
> >
> > 1 thread :  -0.396771472451519
> > 2 threads: -0.396771472450998
> > 8 threads: -0.396771472451149
> >
> > for the metric evalution (i.e. via GetValue() of the metric)
> >
> > AFAICS, This is consistent magnitude of delta from the above.  It will mean
> > not chance of binary equivalence between different threadcounts/partitioning
> > but you can do this accumulation quite a few times before the accumulated
> > divergence gets into digits to worry about.  This sort of thing is
> > avoidable, but at some space/speed cost.
> >
> > However, In the registration for this case it takes only about twenty steps
> > for divergence in the third significant digit between metric estimates! (via
> > registration->GetOptimizer()->GetCurrentMetricValue() )
> >
> > Clearly the optimizer is not following the same path, so I think something
> > else must be going on.
> >
> > So at this point I don't think the data partitioning of the metric is the
> > root cause, but I will have a more careful look later.
> >
> > Any holes in this analysis you can see so far?
> >
> > When I have time to get back into this, I plan to have a look at the
> > optimizer next, unless you have better suggestions of where to look next.
> >
> > cheers,
> > Simon
> >
> >
> >
> > On Wed, Mar 19, 2014 at 12:56 PM, Simon Alexander <[email protected]>
> > wrote:
> >>
> >> Brian, my apologies for the typo.
> >>
> >> I assume you all are at least as busy as I am; just didn't want to leave
> >> the impression that I would definitely be able to pursue this, but I will
> >> try.
> >>
> >>
> >> On Wed, Mar 19, 2014 at 12:45 PM, brian avants <[email protected]> wrote:
> >>>
> >>> it's brian - and, yes, we all have "copious free time" of course.
> >>>
> >>>
> >>> brian
> >>>
> >>>
> >>>
> >>>
> >>> On Wed, Mar 19, 2014 at 12:43 PM, Simon Alexander <[email protected]>
> >>> wrote:
> >>>>
> >>>> Thanks for the summary Brain.
> >>>>
> >>>> A lot of partitioning issues fundamentally  come down to the lack of
> >>>> associativity & distributivity  of fp operations.  Not sure I can do
> >>>> anything practical to improve it  but I will have a look if I can find a 
> >>>> bit
> >>>> of my "copious free time" .
> >>>>
> >>>>
> >>>> On Wed, Mar 19, 2014 at 12:29 PM, brian avants <[email protected]> wrote:
> >>>>>
> >>>>> yes - i understand.
> >>>>>
> >>>>> * matt mccormick implemented compensated summation to address - it
> >>>>> helps but is not a full fix
> >>>>>
> >>>>> * truncating floating point precision greatly reduces the effect you
> >>>>> are talking about but is unatisfactory to most people ... not sure if 
> >>>>> the
> >>>>> functionality for that truncation was taken out of the v4 metrics but 
> >>>>> it was
> >>>>> in there at one point.
> >>>>>
> >>>>> * there may be a small and undiscovered bug that contributes to this in
> >>>>> mattes specificallly but i dont think that's the issue.  we saw this 
> >>>>> effect
> >>>>> even in mean squares.  if there is a bug it may be beyond just mattes.  
> >>>>>  we
> >>>>> cannot disprove that there is a bug.  if anyone knows of way to do 
> >>>>> that, let
> >>>>> me know.
> >>>>>
> >>>>> * any help is appreciated
> >>>>>
> >>>>>
> >>>>> brian
> >>>>>
> >>>>>
> >>>>>
> >>>>>
> >>>>> On Wed, Mar 19, 2014 at 12:24 PM, Simon Alexander
> >>>>> <[email protected]> wrote:
> >>>>>>
> >>>>>> Brain,
> >>>>>>
> >>>>>> I could have sworn I had initially added a follow up email clarifying
> >>>>>> this but since I can't find it in the current quoted exchange, let me
> >>>>>> reiterate:
> >>>>>>
> >>>>>> This is not a case of with different results on different systems.
> >>>>>> This is a case of different results on the same system if you use a
> >>>>>> different number of threads.
> >>>>>>
> >>>>>> So while that possibly could be some odd intrinsics issue, for
> >>>>>> example, the far more likely thing is that data partitioning is not 
> >>>>>> being
> >>>>>> handled in a way that ensures consistency.
> >>>>>>
> >>>>>> Originally I was also seeing intra-system differences due to internal
> >>>>>> precision, but that was a separate issue and has been solved.
> >>>>>>
> >>>>>> Hope that is more clear!
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>> On Wed, Mar 19, 2014 at 12:13 PM, Simon Alexander
> >>>>>> <[email protected]> wrote:
> >>>>>>>
> >>>>>>> Brian,
> >>>>>>>
> >>>>>>> Do you mean the generality of my AVX  internal precision problem?
> >>>>>>>
> >>>>>>> I agree that is a very common issue, the surprising thing there was
> >>>>>>> that we were already constraining the code generation in way that 
> >>>>>>> worked as
> >>>>>>> over the different processor generations and types we used, up until 
> >>>>>>> we hit
> >>>>>>> the first Haswell cpus with AVX2 support (even though no AVX2 
> >>>>>>> instructions
> >>>>>>> were generated).  Perhaps it shouldn't have surprised me, but It took 
> >>>>>>> me a
> >>>>>>> few tests to work that out because the problem was confounded with the
> >>>>>>> problem I discuss in this thread (which is unrelated).  Once I 
> >>>>>>> separated
> >>>>>>> them it was easy to spot.
> >>>>>>>
> >>>>>>> So that is a solved issue for now, but I am still interested the
> >>>>>>> partitioning issue in the image metric, as I only have a work around 
> >>>>>>> for
> >>>>>>> now.
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>> On Wed, Mar 19, 2014 at 11:24 AM, brian avants <[email protected]>
> >>>>>>> wrote:
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> http://software.intel.com/en-us/articles/consistency-of-floating-point-results-using-the-intel-compiler
> >>>>>>>>
> >>>>>>>> just as an example of the generality of this problem
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> brian
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> On Wed, Mar 19, 2014 at 11:22 AM, Simon Alexander
> >>>>>>>> <[email protected]> wrote:
> >>>>>>>>>
> >>>>>>>>> Brian, Luis,
> >>>>>>>>>
> >>>>>>>>> Thanks.  I have been using Mattes as you suspect.
> >>>>>>>>>
> >>>>>>>>> I don't quite understand how precision is specifically the issue
> >>>>>>>>> with # of cores.  There are all kinds of issues with precision and 
> >>>>>>>>> order of
> >>>>>>>>> operations in numerical analysis, but often data partitioning (i.e. 
> >>>>>>>>> for
> >>>>>>>>> concurrency) schemes can be set up so that the actual sums are done 
> >>>>>>>>> the same
> >>>>>>>>> way regardless of number of workers, which keeps your final results
> >>>>>>>>> identical.  Is there some reason this can't be done for the Matte's 
> >>>>>>>>> metric?
> >>>>>>>>> I really should look at the implementation to answer that, of 
> >>>>>>>>> course.
> >>>>>>>>>
> >>>>>>>>> Do you have a pointer to earlier discussions?  If I can find the
> >>>>>>>>> time I'd like to dig into this a bit, but I'm not sure when I'll 
> >>>>>>>>> have the
> >>>>>>>>> bandwidth.  I've "solved" this currently by constraining the core 
> >>>>>>>>> count.
> >>>>>>>>>
> >>>>>>>>> Perhaps interestingly, my earlier experiments were confounded a bit
> >>>>>>>>> by a precision issue, but that had to do with intrinsics generation 
> >>>>>>>>> on my
> >>>>>>>>> compiler behaving differently on systems with AVX2 (even though 
> >>>>>>>>> only AVX
> >>>>>>>>> intrinsics were being generated).  So that made things confusing at 
> >>>>>>>>> first
> >>>>>>>>> until I separated the issues.
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>> On Wed, Mar 19, 2014 at 9:49 AM, brian avants <[email protected]>
> >>>>>>>>> wrote:
> >>>>>>>>>>
> >>>>>>>>>> yes - we had several discussions about this during v4 development.
> >>>>>>>>>>
> >>>>>>>>>> experiments showed that differences are due to precision.
> >>>>>>>>>>
> >>>>>>>>>> one solution was to truncate precision to the point that is
> >>>>>>>>>> reliable.
> >>>>>>>>>>
> >>>>>>>>>> but there are problems with that too.   last i checked, this was
> >>>>>>>>>> an
> >>>>>>>>>>
> >>>>>>>>>> open problem, in general, in computer science.
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>> brian
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>> On Wed, Mar 19, 2014 at 9:16 AM, Luis Ibanez
> >>>>>>>>>> <[email protected]> wrote:
> >>>>>>>>>>>
> >>>>>>>>>>> Hi Simon,
> >>>>>>>>>>>
> >>>>>>>>>>> We are aware of some multi-threading related issues in
> >>>>>>>>>>> the registration process that result in metric values changing
> >>>>>>>>>>> depending on the number of cores used.
> >>>>>>>>>>>
> >>>>>>>>>>> Are you using the MattesMutualInformationMetric ?
> >>>>>>>>>>>
> >>>>>>>>>>> At some point it was suspected that the problem was the
> >>>>>>>>>>> result of accumulative rounding, in the contributions that
> >>>>>>>>>>> each pixel makes to the metric value.... this may or may
> >>>>>>>>>>> not be related to what you are observing.
> >>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>>    Thanks
> >>>>>>>>>>>
> >>>>>>>>>>>        Luis
> >>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>> On Thu, Feb 20, 2014 at 3:27 PM, Simon Alexander
> >>>>>>>>>>> <[email protected]> wrote:
> >>>>>>>>>>>>
> >>>>>>>>>>>> I've been finding some regressions in registration results when
> >>>>>>>>>>>> using systems with different numbers of cores (so the thread 
> >>>>>>>>>>>> count is
> >>>>>>>>>>>> different).  This is resolved by fixing the global max.
> >>>>>>>>>>>>
> >>>>>>>>>>>> It's difficult for me to run the identical code on against
> >>>>>>>>>>>> 4.4.2, but similar experiments were run in that timeframe 
> >>>>>>>>>>>> without these
> >>>>>>>>>>>> regressions.
> >>>>>>>>>>>>
> >>>>>>>>>>>> I recall that there were changes affecting multhreading in the
> >>>>>>>>>>>> v4 registration in 4.5.0 release, so I thought this might be a 
> >>>>>>>>>>>> side effect.
> >>>>>>>>>>>>
> >>>>>>>>>>>> So a few questions:
> >>>>>>>>>>>>
> >>>>>>>>>>>> Is this behaviour expected?
> >>>>>>>>>>>>
> >>>>>>>>>>>> Am I correct that this was not the behaviour in 4.4.x ?
> >>>>>>>>>>>>
> >>>>>>>>>>>> Does anyone who has a feel for  the recent changes 4.4.2 ->
> >>>>>>>>>>>> 4.5.[0,1]  have a good idea where to start looking?  I haven't 
> >>>>>>>>>>>> yet dug into
> >>>>>>>>>>>> the multithreading architecture, but this "smells" like a data 
> >>>>>>>>>>>> partitioning
> >>>>>>>>>>>> issue to me.
> >>>>>>>>>>>>
> >>>>>>>>>>>> Any other thoughts?
> >>>>>>>>>>>>
> >>>>>>>>>>>> cheers,
> >>>>>>>>>>>> Simon
> >>>>>>>>>>>>
> >>>>>>>>>>>> _______________________________________________
> >>>>>>>>>>>> Powered by www.kitware.com
> >>>>>>>>>>>>
> >>>>>>>>>>>> Visit other Kitware open-source projects at
> >>>>>>>>>>>> http://www.kitware.com/opensource/opensource.html
> >>>>>>>>>>>>
> >>>>>>>>>>>> Kitware offers ITK Training Courses, for more information visit:
> >>>>>>>>>>>> http://kitware.com/products/protraining.php
> >>>>>>>>>>>>
> >>>>>>>>>>>> Please keep messages on-topic and check the ITK FAQ at:
> >>>>>>>>>>>> http://www.itk.org/Wiki/ITK_FAQ
> >>>>>>>>>>>>
> >>>>>>>>>>>> Follow this link to subscribe/unsubscribe:
> >>>>>>>>>>>> http://www.itk.org/mailman/listinfo/insight-developers
> >>>>>>>>>>>>
> >>>>>>>>>>>> _______________________________________________
> >>>>>>>>>>>> Community mailing list
> >>>>>>>>>>>> [email protected]
> >>>>>>>>>>>> http://public.kitware.com/cgi-bin/mailman/listinfo/community
> >>>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>> _______________________________________________
> >>>>>>>>>>> Powered by www.kitware.com
> >>>>>>>>>>>
> >>>>>>>>>>> Visit other Kitware open-source projects at
> >>>>>>>>>>> http://www.kitware.com/opensource/opensource.html
> >>>>>>>>>>>
> >>>>>>>>>>> Kitware offers ITK Training Courses, for more information visit:
> >>>>>>>>>>> http://kitware.com/products/protraining.php
> >>>>>>>>>>>
> >>>>>>>>>>> Please keep messages on-topic and check the ITK FAQ at:
> >>>>>>>>>>> http://www.itk.org/Wiki/ITK_FAQ
> >>>>>>>>>>>
> >>>>>>>>>>> Follow this link to subscribe/unsubscribe:
> >>>>>>>>>>> http://www.itk.org/mailman/listinfo/insight-developers
> >>>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>
> >>>>>>>>
> >>>>>>>
> >>>>>>
> >>>>>
> >>>>
> >>>
> >>
> >
> >
> > _______________________________________________
> > Powered by www.kitware.com
> >
> > Visit other Kitware open-source projects at
> > http://www.kitware.com/opensource/opensource.html
> >
> > Kitware offers ITK Training Courses, for more information visit:
> > http://kitware.com/products/protraining.php
> >
> > Please keep messages on-topic and check the ITK FAQ at:
> > http://www.itk.org/Wiki/ITK_FAQ
> >
> > Follow this link to subscribe/unsubscribe:
> > http://www.itk.org/mailman/listinfo/insight-developers
> >
> > _______________________________________________
> > Community mailing list
> > [email protected]
> > http://public.kitware.com/cgi-bin/mailman/listinfo/community
> >
> 
> _______________________________________________
> Powered by www.kitware.com
> 
> Visit other Kitware open-source projects at
> http://www.kitware.com/opensource/opensource.html
> 
> Kitware offers ITK Training Courses, for more information visit:
> http://kitware.com/products/protraining.php
> 
> Please keep messages on-topic and check the ITK FAQ at:
> http://www.itk.org/Wiki/ITK_FAQ
> 
> Follow this link to subscribe/unsubscribe:
> http://www.itk.org/mailman/listinfo/insight-developers
> _______________________________________________
> Community mailing list
> [email protected]
> http://public.kitware.com/cgi-bin/mailman/listinfo/community

_______________________________________________
Powered by www.kitware.com

Visit other Kitware open-source projects at
http://www.kitware.com/opensource/opensource.html

Kitware offers ITK Training Courses, for more information visit:
http://kitware.com/products/protraining.php

Please keep messages on-topic and check the ITK FAQ at:
http://www.itk.org/Wiki/ITK_FAQ

Follow this link to subscribe/unsubscribe:
http://www.itk.org/mailman/listinfo/insight-developers

Reply via email to