On 2011-02-25, at 6:28, "Brian J. Murrell" <[email protected]> wrote:
> On 11-02-25 06:18 AM, Francois  wrote:
>> 
>> I continue to parse debug logs and keep them posted.
> 
> I don't understand why you don't just fix your application to handle a
> perfectly valid and expected condition (that it's currently not
> handling) instead of wasting time trying to find the cause of the
> expected condition.  Even if you find it, it's likely not a bug and not
> something that can/will be fixed.  It's your application that needs to
> be fixed.

In all fairness Brian, it isn't always possible to fix an application like you 
suggest. It might be commercial (binary only), it might be complex code using 
3rd party libraries to do the IO that would lose support if modifed, etc. 

I think the first action to debug this is to run on the client with "lctl 
set_param debug=+trace" or "=~0" which will enable function entry/exit tracing 
in Lustre. Then when the problem us hit run "lctl dk /tmp/debug" to dump the 
Lustre debug log, and search for -4 (which is -EINTR) to see where this error 
is first appearing. 

At that point we can make a determination where the source of the error is, and 
if it is Lustre's fault. I know at one time there was a related problem in the 
l_wait_event() macro that was improperly masking signals, but I thought it was 
fixed by 1.8.5. 

Cheers, Andreas
_______________________________________________
Lustre-discuss mailing list
[email protected]
http://lists.lustre.org/mailman/listinfo/lustre-discuss

Reply via email to