Dear Roy, On Tue, 15 Apr 2008, Roy Stogner wrote:
>> Question #2: Does one of you guys have an idea why this could happen >> (other than my application code doing stupid things)? > > Not a clue. Even if your application code tried to renumber ids > itself, this sort of crash could happen in opt mode but it should at > least hit a useful assert in devel mode. Ah, there we are. I always forgot that asserts do not hit in optimized mode. (Running in debug mode requires recompiling both libMesh and our institute software, since they share the same macros, so I avoid it whenever possible.) >> Question #3: Would you mind adding an >> >> assert(request_to_fill[i]==requested->id()); >> >> at this place, commit it to the repository, and check what that does >> to your applications? > > I'd rather not commit it to the repository yet (since in theory it > should be redundant) but I'll try it out on the examples and on my own > apps just in case the theory is incorrect. Have you tried this assert > in your own code yet? Does it trigger? No, see above. (Instead of using an assert, I wrote some message to stdout if these numbers did not coincide, and that did trigger.) >> Question #4: Will I want to try to create a simple example that >> reproduces the problem, although this might take me quite a long time? >> (This question is rhetorical.) > > Unfortunately so. Parallel debugging is hard enough when you've got > the breaking app right in front of you; it's practically impossible by > proxy. You're right. And I must admit that I found out that it was my fault. You might want to know what I did, so I'll tell you: At some place in my code, I forgot to ALLREDUCE some important quantities. This led to an inconsistent refinement flagging of the grid over the different processors. Hence, after the refinement, the processors were working on different grids. Then, processor #0 asked processor #1 to send some information about grid node #49210, but processor #1 found that the grid only contains 49200 nodes, so that looking up node #49210 gave some random result. I wonder why my application code was running in earlier days. That was indeed on a different cluster, but it was nevertheless a cluster. Anyway, thank you for listening. (-: Best Regards, Tim -- Dr. Tim Kroeger Phone +49-421-218-7710 [EMAIL PROTECTED], [EMAIL PROTECTED] Fax +49-421-218-4236 MeVis Research GmbH, Universitaetsallee 29, 28359 Bremen, Germany Amtsgericht Bremen HRB 16222 Geschaeftsfuehrer: Prof. Dr. H.-O. Peitgen ------------------------------------------------------------------------- This SF.net email is sponsored by the 2008 JavaOne(SM) Conference Don't miss this year's exciting event. There's still time to save $100. Use priority code J8TL2D2. http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone _______________________________________________ Libmesh-users mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/libmesh-users
