Dear Roy,

On Tue, 15 Apr 2008, Roy Stogner wrote:

>> Question #2: Does one of you guys have an idea why this could happen
>> (other than my application code doing stupid things)?
>
> Not a clue.  Even if your application code tried to renumber ids
> itself, this sort of crash could happen in opt mode but it should at
> least hit a useful assert in devel mode.

Ah, there we are.  I always forgot that asserts do not hit in 
optimized mode.  (Running in debug mode requires recompiling both 
libMesh and our institute software, since they share the same macros, 
so I avoid it whenever possible.)

>> Question #3: Would you mind adding an
>>
>>      assert(request_to_fill[i]==requested->id());
>> 
>> at this place, commit it to the repository, and check what that does
>> to your applications?
>
> I'd rather not commit it to the repository yet (since in theory it
> should be redundant) but I'll try it out on the examples and on my own
> apps just in case the theory is incorrect.  Have you tried this assert
> in your own code yet?  Does it trigger?

No, see above.  (Instead of using an assert, I wrote some message to 
stdout if these numbers did not coincide, and that did trigger.)

>> Question #4: Will I want to try to create a simple example that
>> reproduces the problem, although this might take me quite a long time?
>> (This question is rhetorical.)
>
> Unfortunately so.  Parallel debugging is hard enough when you've got
> the breaking app right in front of you; it's practically impossible by
> proxy.

You're right.  And I must admit that I found out that it was my fault. 
You might want to know what I did, so I'll tell you:

At some place in my code, I forgot to ALLREDUCE some important 
quantities.  This led to an inconsistent refinement flagging of the 
grid over the different processors.  Hence, after the refinement, the 
processors were working on different grids.  Then, processor #0 asked 
processor #1 to send some information about grid node #49210, but 
processor #1 found that the grid only contains 49200 nodes, so that 
looking up node #49210 gave some random result.

I wonder why my application code was running in earlier days.  That 
was indeed on a different cluster, but it was nevertheless a cluster.

Anyway, thank you for listening. (-:

Best Regards,

Tim

-- 
Dr. Tim Kroeger                                        Phone +49-421-218-7710
[EMAIL PROTECTED], [EMAIL PROTECTED]  Fax   +49-421-218-4236

MeVis Research GmbH, Universitaetsallee 29, 28359 Bremen, Germany

Amtsgericht Bremen HRB 16222
Geschaeftsfuehrer: Prof. Dr. H.-O. Peitgen

-------------------------------------------------------------------------
This SF.net email is sponsored by the 2008 JavaOne(SM) Conference 
Don't miss this year's exciting event. There's still time to save $100. 
Use priority code J8TL2D2. 
http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone
_______________________________________________
Libmesh-users mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/libmesh-users

Reply via email to