Simon, In this analysis, are you concerned that some times 80-bit floating point registers are used [1] [2]? While if SSE instructions are used the computation may only be done at 32-bit?
Brad [1] http://en.wikipedia.org/wiki/Extended_precision#IBM_extended_precision_formats [2] http://stackoverflow.com/questions/5738448/is-float-slower-than-double-does-64-bit-program-run-faster-than-32-bit-program On Mar 31, 2014, at 10:12 AM, Simon Alexander <[email protected]> wrote: > Hi Matt, > > As noted in previous msg, the magnitude of the divergence doesn't' seem to be > very large on typical usage patterns, so from a practical point of view it > might be better to bound it, but I haven't done a very careful analysis yet. > > The most straightforward way to avoid this sort of thing is to always do the > sum the same way regardless of the number of threads. Simplest approach to > that is accumulating all the constituents into a (stably) ordered data > structure and performing the sum at the end. Of course, this costs you both > the memory and the transaction costs, which may not be acceptable. On the > other hand, with enough elements you'll want some partitioning anyway for > accuracy, but this is best determined by the data range, not the number of > threads. > > Most of the other methods I've used to avoid or mitigate this in the past > require detailed analysis of local computation and bounding the error at each > step, so you can safely truncate the final result to known digits, and > regardless of order the final estimate will be identical. You can sometimes > avoid any speed or space costs this way, but it's fiddly work and has to be > reevaluated with any change. > > > > On Sun, Mar 30, 2014 at 10:07 AM, Matt McCormick <[email protected]> > wrote: > Hi Simon, > > Thanks for taking a look. > > Yes, your assessment is correct. What is a strategy that would avoid > this, though? > > More eyes on the optimizers are greatly welcome! > > Thanks, > Matt > > On Fri, Mar 28, 2014 at 4:57 PM, Simon Alexander <[email protected]> > wrote: > > There is a lot going on here, and I'm not certain that I've got all the > > moving pieces straight in my mind yet, but I've had an quick look at the > > implementation now. I believe the Mattes v4 implementation is similar to > > other metrics it it's approach. > > > > As I suggested earlier in the thread: I believe accumulations like this: > > > >> for( ThreadIdType threadID = 1; threadID < > >> this->GetNumberOfThreadsUsed(); threadID++ ) > >> { > >> this->m_ThreaderJointPDFSum[0] += > >> this->m_ThreaderJointPDFSum[threadID]; > >> } > > > > > > will guarantee that we don't have absolute consistent results between > > different threadcounts, due to lack of associativity. > > > > When I perform only transform initialization and a single evaluation of the > > metric (i.e. outside of the registration routines), I get results consistent > > with this, for example, results for an center-of-mass initialization between > > two MR image volumes give me (double precision): > > > > 1 thread : -0.396771472451519 > > 2 threads: -0.396771472450998 > > 8 threads: -0.396771472451149 > > > > for the metric evalution (i.e. via GetValue() of the metric) > > > > AFAICS, This is consistent magnitude of delta from the above. It will mean > > not chance of binary equivalence between different threadcounts/partitioning > > but you can do this accumulation quite a few times before the accumulated > > divergence gets into digits to worry about. This sort of thing is > > avoidable, but at some space/speed cost. > > > > However, In the registration for this case it takes only about twenty steps > > for divergence in the third significant digit between metric estimates! (via > > registration->GetOptimizer()->GetCurrentMetricValue() ) > > > > Clearly the optimizer is not following the same path, so I think something > > else must be going on. > > > > So at this point I don't think the data partitioning of the metric is the > > root cause, but I will have a more careful look later. > > > > Any holes in this analysis you can see so far? > > > > When I have time to get back into this, I plan to have a look at the > > optimizer next, unless you have better suggestions of where to look next. > > > > cheers, > > Simon > > > > > > > > On Wed, Mar 19, 2014 at 12:56 PM, Simon Alexander <[email protected]> > > wrote: > >> > >> Brian, my apologies for the typo. > >> > >> I assume you all are at least as busy as I am; just didn't want to leave > >> the impression that I would definitely be able to pursue this, but I will > >> try. > >> > >> > >> On Wed, Mar 19, 2014 at 12:45 PM, brian avants <[email protected]> wrote: > >>> > >>> it's brian - and, yes, we all have "copious free time" of course. > >>> > >>> > >>> brian > >>> > >>> > >>> > >>> > >>> On Wed, Mar 19, 2014 at 12:43 PM, Simon Alexander <[email protected]> > >>> wrote: > >>>> > >>>> Thanks for the summary Brain. > >>>> > >>>> A lot of partitioning issues fundamentally come down to the lack of > >>>> associativity & distributivity of fp operations. Not sure I can do > >>>> anything practical to improve it but I will have a look if I can find a > >>>> bit > >>>> of my "copious free time" . > >>>> > >>>> > >>>> On Wed, Mar 19, 2014 at 12:29 PM, brian avants <[email protected]> wrote: > >>>>> > >>>>> yes - i understand. > >>>>> > >>>>> * matt mccormick implemented compensated summation to address - it > >>>>> helps but is not a full fix > >>>>> > >>>>> * truncating floating point precision greatly reduces the effect you > >>>>> are talking about but is unatisfactory to most people ... not sure if > >>>>> the > >>>>> functionality for that truncation was taken out of the v4 metrics but > >>>>> it was > >>>>> in there at one point. > >>>>> > >>>>> * there may be a small and undiscovered bug that contributes to this in > >>>>> mattes specificallly but i dont think that's the issue. we saw this > >>>>> effect > >>>>> even in mean squares. if there is a bug it may be beyond just mattes. > >>>>> we > >>>>> cannot disprove that there is a bug. if anyone knows of way to do > >>>>> that, let > >>>>> me know. > >>>>> > >>>>> * any help is appreciated > >>>>> > >>>>> > >>>>> brian > >>>>> > >>>>> > >>>>> > >>>>> > >>>>> On Wed, Mar 19, 2014 at 12:24 PM, Simon Alexander > >>>>> <[email protected]> wrote: > >>>>>> > >>>>>> Brain, > >>>>>> > >>>>>> I could have sworn I had initially added a follow up email clarifying > >>>>>> this but since I can't find it in the current quoted exchange, let me > >>>>>> reiterate: > >>>>>> > >>>>>> This is not a case of with different results on different systems. > >>>>>> This is a case of different results on the same system if you use a > >>>>>> different number of threads. > >>>>>> > >>>>>> So while that possibly could be some odd intrinsics issue, for > >>>>>> example, the far more likely thing is that data partitioning is not > >>>>>> being > >>>>>> handled in a way that ensures consistency. > >>>>>> > >>>>>> Originally I was also seeing intra-system differences due to internal > >>>>>> precision, but that was a separate issue and has been solved. > >>>>>> > >>>>>> Hope that is more clear! > >>>>>> > >>>>>> > >>>>>> > >>>>>> On Wed, Mar 19, 2014 at 12:13 PM, Simon Alexander > >>>>>> <[email protected]> wrote: > >>>>>>> > >>>>>>> Brian, > >>>>>>> > >>>>>>> Do you mean the generality of my AVX internal precision problem? > >>>>>>> > >>>>>>> I agree that is a very common issue, the surprising thing there was > >>>>>>> that we were already constraining the code generation in way that > >>>>>>> worked as > >>>>>>> over the different processor generations and types we used, up until > >>>>>>> we hit > >>>>>>> the first Haswell cpus with AVX2 support (even though no AVX2 > >>>>>>> instructions > >>>>>>> were generated). Perhaps it shouldn't have surprised me, but It took > >>>>>>> me a > >>>>>>> few tests to work that out because the problem was confounded with the > >>>>>>> problem I discuss in this thread (which is unrelated). Once I > >>>>>>> separated > >>>>>>> them it was easy to spot. > >>>>>>> > >>>>>>> So that is a solved issue for now, but I am still interested the > >>>>>>> partitioning issue in the image metric, as I only have a work around > >>>>>>> for > >>>>>>> now. > >>>>>>> > >>>>>>> > >>>>>>> > >>>>>>> On Wed, Mar 19, 2014 at 11:24 AM, brian avants <[email protected]> > >>>>>>> wrote: > >>>>>>>> > >>>>>>>> > >>>>>>>> http://software.intel.com/en-us/articles/consistency-of-floating-point-results-using-the-intel-compiler > >>>>>>>> > >>>>>>>> just as an example of the generality of this problem > >>>>>>>> > >>>>>>>> > >>>>>>>> brian > >>>>>>>> > >>>>>>>> > >>>>>>>> > >>>>>>>> > >>>>>>>> On Wed, Mar 19, 2014 at 11:22 AM, Simon Alexander > >>>>>>>> <[email protected]> wrote: > >>>>>>>>> > >>>>>>>>> Brian, Luis, > >>>>>>>>> > >>>>>>>>> Thanks. I have been using Mattes as you suspect. > >>>>>>>>> > >>>>>>>>> I don't quite understand how precision is specifically the issue > >>>>>>>>> with # of cores. There are all kinds of issues with precision and > >>>>>>>>> order of > >>>>>>>>> operations in numerical analysis, but often data partitioning (i.e. > >>>>>>>>> for > >>>>>>>>> concurrency) schemes can be set up so that the actual sums are done > >>>>>>>>> the same > >>>>>>>>> way regardless of number of workers, which keeps your final results > >>>>>>>>> identical. Is there some reason this can't be done for the Matte's > >>>>>>>>> metric? > >>>>>>>>> I really should look at the implementation to answer that, of > >>>>>>>>> course. > >>>>>>>>> > >>>>>>>>> Do you have a pointer to earlier discussions? If I can find the > >>>>>>>>> time I'd like to dig into this a bit, but I'm not sure when I'll > >>>>>>>>> have the > >>>>>>>>> bandwidth. I've "solved" this currently by constraining the core > >>>>>>>>> count. > >>>>>>>>> > >>>>>>>>> Perhaps interestingly, my earlier experiments were confounded a bit > >>>>>>>>> by a precision issue, but that had to do with intrinsics generation > >>>>>>>>> on my > >>>>>>>>> compiler behaving differently on systems with AVX2 (even though > >>>>>>>>> only AVX > >>>>>>>>> intrinsics were being generated). So that made things confusing at > >>>>>>>>> first > >>>>>>>>> until I separated the issues. > >>>>>>>>> > >>>>>>>>> > >>>>>>>>> On Wed, Mar 19, 2014 at 9:49 AM, brian avants <[email protected]> > >>>>>>>>> wrote: > >>>>>>>>>> > >>>>>>>>>> yes - we had several discussions about this during v4 development. > >>>>>>>>>> > >>>>>>>>>> experiments showed that differences are due to precision. > >>>>>>>>>> > >>>>>>>>>> one solution was to truncate precision to the point that is > >>>>>>>>>> reliable. > >>>>>>>>>> > >>>>>>>>>> but there are problems with that too. last i checked, this was > >>>>>>>>>> an > >>>>>>>>>> > >>>>>>>>>> open problem, in general, in computer science. > >>>>>>>>>> > >>>>>>>>>> > >>>>>>>>>> brian > >>>>>>>>>> > >>>>>>>>>> > >>>>>>>>>> > >>>>>>>>>> > >>>>>>>>>> On Wed, Mar 19, 2014 at 9:16 AM, Luis Ibanez > >>>>>>>>>> <[email protected]> wrote: > >>>>>>>>>>> > >>>>>>>>>>> Hi Simon, > >>>>>>>>>>> > >>>>>>>>>>> We are aware of some multi-threading related issues in > >>>>>>>>>>> the registration process that result in metric values changing > >>>>>>>>>>> depending on the number of cores used. > >>>>>>>>>>> > >>>>>>>>>>> Are you using the MattesMutualInformationMetric ? > >>>>>>>>>>> > >>>>>>>>>>> At some point it was suspected that the problem was the > >>>>>>>>>>> result of accumulative rounding, in the contributions that > >>>>>>>>>>> each pixel makes to the metric value.... this may or may > >>>>>>>>>>> not be related to what you are observing. > >>>>>>>>>>> > >>>>>>>>>>> > >>>>>>>>>>> Thanks > >>>>>>>>>>> > >>>>>>>>>>> Luis > >>>>>>>>>>> > >>>>>>>>>>> > >>>>>>>>>>> > >>>>>>>>>>> On Thu, Feb 20, 2014 at 3:27 PM, Simon Alexander > >>>>>>>>>>> <[email protected]> wrote: > >>>>>>>>>>>> > >>>>>>>>>>>> I've been finding some regressions in registration results when > >>>>>>>>>>>> using systems with different numbers of cores (so the thread > >>>>>>>>>>>> count is > >>>>>>>>>>>> different). This is resolved by fixing the global max. > >>>>>>>>>>>> > >>>>>>>>>>>> It's difficult for me to run the identical code on against > >>>>>>>>>>>> 4.4.2, but similar experiments were run in that timeframe > >>>>>>>>>>>> without these > >>>>>>>>>>>> regressions. > >>>>>>>>>>>> > >>>>>>>>>>>> I recall that there were changes affecting multhreading in the > >>>>>>>>>>>> v4 registration in 4.5.0 release, so I thought this might be a > >>>>>>>>>>>> side effect. > >>>>>>>>>>>> > >>>>>>>>>>>> So a few questions: > >>>>>>>>>>>> > >>>>>>>>>>>> Is this behaviour expected? > >>>>>>>>>>>> > >>>>>>>>>>>> Am I correct that this was not the behaviour in 4.4.x ? > >>>>>>>>>>>> > >>>>>>>>>>>> Does anyone who has a feel for the recent changes 4.4.2 -> > >>>>>>>>>>>> 4.5.[0,1] have a good idea where to start looking? I haven't > >>>>>>>>>>>> yet dug into > >>>>>>>>>>>> the multithreading architecture, but this "smells" like a data > >>>>>>>>>>>> partitioning > >>>>>>>>>>>> issue to me. > >>>>>>>>>>>> > >>>>>>>>>>>> Any other thoughts? > >>>>>>>>>>>> > >>>>>>>>>>>> cheers, > >>>>>>>>>>>> Simon > >>>>>>>>>>>> > >>>>>>>>>>>> _______________________________________________ > >>>>>>>>>>>> Powered by www.kitware.com > >>>>>>>>>>>> > >>>>>>>>>>>> Visit other Kitware open-source projects at > >>>>>>>>>>>> http://www.kitware.com/opensource/opensource.html > >>>>>>>>>>>> > >>>>>>>>>>>> Kitware offers ITK Training Courses, for more information visit: > >>>>>>>>>>>> http://kitware.com/products/protraining.php > >>>>>>>>>>>> > >>>>>>>>>>>> Please keep messages on-topic and check the ITK FAQ at: > >>>>>>>>>>>> http://www.itk.org/Wiki/ITK_FAQ > >>>>>>>>>>>> > >>>>>>>>>>>> Follow this link to subscribe/unsubscribe: > >>>>>>>>>>>> http://www.itk.org/mailman/listinfo/insight-developers > >>>>>>>>>>>> > >>>>>>>>>>>> _______________________________________________ > >>>>>>>>>>>> Community mailing list > >>>>>>>>>>>> [email protected] > >>>>>>>>>>>> http://public.kitware.com/cgi-bin/mailman/listinfo/community > >>>>>>>>>>>> > >>>>>>>>>>> > >>>>>>>>>>> > >>>>>>>>>>> _______________________________________________ > >>>>>>>>>>> Powered by www.kitware.com > >>>>>>>>>>> > >>>>>>>>>>> Visit other Kitware open-source projects at > >>>>>>>>>>> http://www.kitware.com/opensource/opensource.html > >>>>>>>>>>> > >>>>>>>>>>> Kitware offers ITK Training Courses, for more information visit: > >>>>>>>>>>> http://kitware.com/products/protraining.php > >>>>>>>>>>> > >>>>>>>>>>> Please keep messages on-topic and check the ITK FAQ at: > >>>>>>>>>>> http://www.itk.org/Wiki/ITK_FAQ > >>>>>>>>>>> > >>>>>>>>>>> Follow this link to subscribe/unsubscribe: > >>>>>>>>>>> http://www.itk.org/mailman/listinfo/insight-developers > >>>>>>>>>>> > >>>>>>>>>> > >>>>>>>>> > >>>>>>>> > >>>>>>> > >>>>>> > >>>>> > >>>> > >>> > >> > > > > > > _______________________________________________ > > Powered by www.kitware.com > > > > Visit other Kitware open-source projects at > > http://www.kitware.com/opensource/opensource.html > > > > Kitware offers ITK Training Courses, for more information visit: > > http://kitware.com/products/protraining.php > > > > Please keep messages on-topic and check the ITK FAQ at: > > http://www.itk.org/Wiki/ITK_FAQ > > > > Follow this link to subscribe/unsubscribe: > > http://www.itk.org/mailman/listinfo/insight-developers > > > > _______________________________________________ > > Community mailing list > > [email protected] > > http://public.kitware.com/cgi-bin/mailman/listinfo/community > > > > _______________________________________________ > Powered by www.kitware.com > > Visit other Kitware open-source projects at > http://www.kitware.com/opensource/opensource.html > > Kitware offers ITK Training Courses, for more information visit: > http://kitware.com/products/protraining.php > > Please keep messages on-topic and check the ITK FAQ at: > http://www.itk.org/Wiki/ITK_FAQ > > Follow this link to subscribe/unsubscribe: > http://www.itk.org/mailman/listinfo/insight-developers > _______________________________________________ > Community mailing list > [email protected] > http://public.kitware.com/cgi-bin/mailman/listinfo/community
_______________________________________________ Powered by www.kitware.com Visit other Kitware open-source projects at http://www.kitware.com/opensource/opensource.html Kitware offers ITK Training Courses, for more information visit: http://kitware.com/products/protraining.php Please keep messages on-topic and check the ITK FAQ at: http://www.itk.org/Wiki/ITK_FAQ Follow this link to subscribe/unsubscribe: http://www.itk.org/mailman/listinfo/insight-developers
