I encountered similar funny bugs in the past. It was usually one or more of - read/write out of bounds - race condition or other multithread effects, e.g. interaction between TBB/OpenMP/Threads/Blas/...) - dangling reference (mainly with clang) - compiler bug (somewhat in decreasing order of likelihood).
I would try to - disable parallelization as much as possible - run in release mode instead of debug: may give some hint if it is a race condition / timing issue - use asan/tsan: I've found the sanitizers more reliable than valgrind, in particular for multithreaded code, although I haven't used those with deal.II (requires recompilation) - debug both variants in parallel and check for differences Some more desparate attempts: - try different compiler - set variables to some fixed values here and there - output to stringstream instead of file - insert sleep/wait commands (although probably none of these really helps to identify the problem) You can also define _GLIBCXX_ASSERTIONS to enable bound checks for operator[] in libstdc++. Good luck! [email protected] schrieb am Donnerstag, 30. März 2023 um 20:18:31 UTC+2: > Hello, > > Usually when I have this kind of bug, there are two possibilities: > 1. I am using an un-initialized value > 2. I am writing out of bound > > What I do is using valgrind with my code in Debug mode and without TBB > enabled otherwise you get difficult to understand backtrace like here. > However, what I've also encountered is that in Debug mode, the compiler > would initialize the variable but it wouldn't do it in Release... So make > sure that all the variables are initialized and when using a std::vector, > use at() instead of operator[]. Only use at() in Debug mode because it is > much slower than operator[]. Other than that rerun valgrind without TBB > enabled, you should get something more meaningful. > > Best, > > Bruno > > > On Wednesday, March 29, 2023 at 12:41:24 PM UTC-4 Simon wrote: > >> Too early pleased -- the issue still persists. >> >> As described above, adding the reference symbol & to JxW and global_index >> makes a difference in my program, but is not the solution. >> >> I ran my program in debug mode using >> valgrind --tool=memcheck --leak-check=full ./my_program >> >> The valgrind output associated with >> void fun(const Tensor<2,3 & F, >> const double JxW, >> const unsigned int global_index, >> std::vector<Vector<double>> & sensitivity_matrix) >> { >> // use F and JxW to write some values into sensitivity_matrix, which >> is passed by reference since it is a large matrix >> if(print_or_not==true) std::cout<<"print anything here..."<<std::endl >> } >> is >> [image: memcheck_NotOutC.png] >> >> The valgrind output associated with >> void fun(const Tensor<2,3 & F, >> const double JxW, >> const unsigned int global_index, >> std::vector<Vector<double>> & sensitivity_matrix) >> { >> // use F and JxW to write some values into sensitivity_matrix, which >> is passed by reference since it is a large matrix >> // if(print_or_not==true) std::cout<<"print anything >> here..."<<std::endl >> } >> [image: memcheck_outC.png] >> >> The differences are "only" possibly lost messages, but the backtraces are >> not really clear to me. >> Do you see any problems with my code by inspecting the memory check? >> >> Based on your experience, I would appreciate your feedback regarding how >> to debug my problem further. >> Adding print statements clearly is not really helpful if my program shows >> undefined behavior and >> if the print statement itself is what causes differences. >> >> Thank you! >> >> On Wednesday, March 29, 2023 at 3:02:52 PM UTC+2 Simon wrote: >> >> Of course, Approach B is the right way to go if I want to produce >> reliable results. >> >> After plenty of hours of debugging, I found a possible source for the >> differences: >> The member function called by my assembly routine has the following >> signature >> void fun(const Tensor<2,3 & F, >> const double JxW, >> const unsigned int global_index, >> std::vector<Vector<double>> & sensitivity_matrix) >> { >> //use F and JxW to write some values into sensitivity_matrix at >> global_index >> } >> >> To debug my problem, I wrapped the << statement into an if-condition that >> I controlled via my parameter file: >> { >> // use F and JxW to write some values into sensitivity_matrix, which >> is passed by reference since it is a large matrix >> if(print_or_not==true) std::cout<<"print anything here..."<<std::endl >> } >> Interestingly, setting print_or_not==false or just out-commenting the >> line changes some results of my program. >> In other words, the output obtained by >> { >> // use F and JxW to write some values into sensitivity_matrix, which >> is passed by reference since it is a large matrix >> if(false) std::cout<<"print anything here..."<<std::endl >> } >> is different from the output obtained by >> { >> // use F and JxW to write some values into sensitivity_matrix, which >> is passed by reference since it is a large matrix >> // if(false) std::cout<<"print anything here..."<<std::endl >> } >> >> Aside, adding more << statements inside fun changed the output again. >> By output, I refer to sensitivity_matrix after assembly has finished. >> Differences can be seen only for some entries of sensitivity_matrix and, >> as for them, only the 13th or higher digits after the decimal point. >> >> Then, I changed the signature of fun to >> void fun(const Tensor<2,3 & F, >> const double *& *JxW, >> const unsigned int *&* global_index >> std::vector<Vector<double>> & sensitivity_matrix) >> Adding the reference symbol & to JxW and global_index, >> the program output is identical, regardless of the number of print >> statements -- as it should be the case. >> Within the assembly function (which calls fun()), JxW and global_index >> are defined as always in a deal.ii program: >> const double JxW=fe_values.JxW(k); >> std::vector<types::global_dof_index> local_dof_indices(...); -> >> cell->get_dof_indices(local_dof_indices); >> gobal_index = local_dof_indices[i]; >> >> Could the missing reference symbol & be an explanation for why I observed >> the undefined behavior of my program? >> It clearly changes something, but I do not really understand why that >> might be the case. >> >> Best >> Simon >> >> >> >> >> >> >> >> Am Fr., 24. März 2023 um 22:38 Uhr schrieb Wolfgang Bangerth < >> [email protected]>: >> >> On 3/24/23 11:04, Simon wrote: >> > >> > Having all that said, is there any reasonable argument as to what could >> > cause the program output to be different by just adding ...<<... >> statement? >> >> There is no reasonable argument why that should be the case. The >> function you are calling looks like this: >> >> template <int rank_, int dim, typename Number> >> inline std::ostream & >> operator<<(std::ostream &out, const Tensor<rank_, dim, Number> &p) >> { >> for (unsigned int i = 0; i < dim; ++i) >> { >> out << p[i]; >> if (i != dim - 1) >> out << ' '; >> } >> >> return out; >> } >> >> It *should* have no side effects. If it does, there is something funny >> going on in your program. >> >> That leaves a philosophical question: >> >> > Admittedly, I do not know how to debug this further, because there is >> in >> > theory >> > nothing to debug if I can trace the differences back to the ...<<... >> > statement. >> >> Approach A is to say "Well, let me just remove the output statement >> then; I know that my program works in that case". Approach B is to say >> "This situation illustrates that I don't understand something >> fundamental about my program. Let me use the occasion to investigate". >> Everyone who has been following this mailing list for a while will >> suspect that I lean quite heavily towards Approach B. If the state of >> the program changes in ways I do not understand, and that make no sense >> to me, it is not unreasonable to suspect that there is a conceptual bug >> somewhere, however unpleasant it may be to search for it, and that not >> investigating the cause for the issue is really just closing one's eyes >> towards a problem that is clearly there. >> >> In your case, I would start by outputting vastly more intermediate >> results (up to machine precision) to see where the first time is that >> something is different between the program with and the one without the >> << statement. Maybe you can identify which variable it is that first >> changes, and if you know that, it may be easier to also see *why* it >> changes. You could wrap the operator<< statement in some kind of >> if-condition that you can control from outside the program, and then run >> your program twice every time, once with and without the output >> statement. Pipe the output of each run into two files that you can then >> run 'diff' on. >> >> From experience, this is not likely going to be a bug that is fun to >> chase down. But it might make for a good learning opportunity. It would >> be quite interesting to hear back here if you find out what caused the >> issue! >> >> Best >> W. >> >> -- >> ------------------------------------------------------------------------ >> Wolfgang Bangerth email: [email protected] >> www: http://www.math.colostate.edu/~bangerth/ >> >> -- >> The deal.II project is located at http://www.dealii.org/ >> For mailing list/forum options, see >> https://groups.google.com/d/forum/dealii?hl=en >> --- >> You received this message because you are subscribed to the Google Groups >> "deal.II User Group" group. >> To unsubscribe from this group and stop receiving emails from it, send an >> email to [email protected]. >> To view this discussion on the web visit >> https://groups.google.com/d/msgid/dealii/da62c7c6-6328-e385-7029-afe5edf43ecd%40colostate.edu >> . >> >> >> -- The deal.II project is located at http://www.dealii.org/ For mailing list/forum options, see https://groups.google.com/d/forum/dealii?hl=en --- You received this message because you are subscribed to the Google Groups "deal.II User Group" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To view this discussion on the web visit https://groups.google.com/d/msgid/dealii/7dacc37b-6e04-4bb1-ac30-27e68ee910afn%40googlegroups.com.
