Please don't reply to lustre-devel. Instead, comment in Bugzilla by using the 
following link:
https://bugzilla.lustre.org/show_bug.cgi?id=11710



(In reply to comment #60)
> Created an attachment (id=10547)
Please don't reply to lustre-devel. Instead, comment in Bugzilla by using the 
following link:
 --> (https://bugzilla.lustre.org/attachment.cgi?id=10547&action=view)
> handling recoverable error. v2

Thanks Alexey, I had a chance to review the latest proposed patch and DLD.
In general I like the refined approach, in particular breaking the 
recoverable errors in two levels at the rpc and osc layer is nice.
Plus the optimization of tracking the readonly state in the import is
nice, but the side effect of having a client evicted is it a higher
price than I think we're willing to pay.  After auditing the patch 
I still have other concerns.

1) Minor: The proposed patch is against 1.6 not 1.4, so I have not yet tried
the patch I've only read through it.  It will need to be backported to
1.4 because this is serious flaw for 1.4 based release as well.

2) I'm not thrilled with how fatal errors are now handled by simply
flagging the page in error in the client page cache.  This approach
is simply wasteful of the client cache and to my paranoid mind a little
bit dangerous for client cache cohearency.  There's no reason to leave 
this page around, it really needs to be discarded as soon as possible. 

Perhaps the best solution which now occurs to me is to keep a list of
these pages as was done in LLNL patch.  But instead of adding another
ll_ap_truncate() method (which was clunky) we could integrate this cache
reclamation in to llap_shrink_cache().  This function would have access 
to this fatal page list and all pages in error could be disposed of first 
before touching any other pages.  We could then simply make sure
llap_shrunk_cache() is called from brw_interpret_oap() after 
osc_ap_completion() has run on all the pages.

We also need to add a multinode test to ensure clients mounted on
different physcal nodes see the exact same behavior in the fatal
and recoverable case.  Consistency is very import to us.

3) With this patch ll_ap_completion cannot fail, we should take a look
at the other users such as obdecho/liblustre and see if it makes sense 
to make this a void return.  If they can't fail either then the
conditional check in osc_ap_completion() is just pointless.


I'm sure we'll give this a spin at LLNL soon.  But just as a heads up there
is a good chance we'll stick with our local patch for our 1.4.8 based release.
We have many weeks of successful testing at scale with this patch and our
deadline for a getting a new production release ready is almost here.

_______________________________________________
Lustre-devel mailing list
Lustre-devel@clusterfs.com
https://mail.clusterfs.com/mailman/listinfo/lustre-devel

Reply via email to