On Wed, 16 Apr 2008, Tim Kroeger wrote:

> Ah, there we are.  I always forgot that asserts do not hit in optimized mode. 
> (Running in debug mode requires recompiling both libMesh and our institute 
> software, since they share the same macros, so I avoid it whenever possible.)

Understandable; libMesh takes quite a while to compile, I'm told.  I
do library development in a lab with distcc on all the computers and
"make -j 30" in my shell scripts, so I'm afraid I usually don't
notice and I've got little incentive to improve things.  ;-)

>> Unfortunately so.  Parallel debugging is hard enough when you've got
>> the breaking app right in front of you; it's practically impossible by
>> proxy.
>
> You're right.  And I must admit that I found out that it was my fault.

That's reassuring; thanks for letting us know.  The twin terror of
"this apparently straightforward code is behaving in impossible ways"
and "I might have inadvertently broken SerialMesh code" wasn't good.

> You might want to know what I did, so I'll tell you:
>
> At some place in my code, I forgot to ALLREDUCE some important quantities. 
> This led to an inconsistent refinement flagging of the grid over the 
> different processors.  Hence, after the refinement, the processors were 
> working on different grids.

That would do it.

I wonder if we should do something to help catch such problems even in
opt mode.  Testing a whole SerialMesh for consistency might be
overkill for an "optimized" code, but we might take one message's
latency hit to at least make sure that n_nodes() and n_elem() are the
same after a completed refinement.  Thoughts, anyone?

> I wonder why my application code was running in earlier days.  That was 
> indeed on a different cluster, but it was nevertheless a cluster.

Parallelism bugs can depend on how many processors you're using or
simply on how the partitioning unfolds.  It's not fun to debug when
you catch a failure on 32 cores on a fine mesh that doesn't
immediately whittle down to 2 cores on a coarse mesh.
---
Roy

-------------------------------------------------------------------------
This SF.net email is sponsored by the 2008 JavaOne(SM) Conference 
Don't miss this year's exciting event. There's still time to save $100. 
Use priority code J8TL2D2. 
http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone
_______________________________________________
Libmesh-users mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/libmesh-users

Reply via email to