Hello!
On Oct 30, 2016, at 8:33 AM, Thomas Roth wrote:
> Hi all,
>
> we have a larger amount of files that give ??? on 'ls' and the error "Cannot
> allocate memory"
> The corresponding error on the OSS is
> "lvbo_init failed for resource ... rc = -2"
>
> This seems similar to LU-5457 (although the OSTs do not go into disconn
> state).
> Our filesystem is on Lustre 2.5.3, zfs 0.6.3, from the start. So per Oleg's
> explanation,
> "this could be fallout from earlier sync failures where OST announced it
> created some objects, failed to sync that to disk and then after dying and
> restarting the objects that were handed out by MDTs out of this pool are no
> longer there"
>
> The affected OSTs are evenly distributed, however.
> Finding the creation time of those files is difficult at best, but I am not
> aware of any series of crashes of so many OSSes in the recent months.
> And how can this happen with ZFS-OSTs? Should this be possible so easily?
First of all, 2.5.3 is kind of old.
The error itself means that you have a file on MDS, but no corresponding
objects.
The explanation in LU-5457 is just one possible scenario, but there might be
others
that cause the objects to be deleted.
Is there a pattern to the files? I.e. is it so that all such files were
created
at aroudn the same time (if you cannot tell just by the filename/location,
you might
use debugfs/whatever zfs equivalent to look at inode modification time.)
If they are distributed in time on different OSTs, but localised for every
one OST
individually, might be a good idea to check OST logs from that period.
Bye,
Oleg
_______________________________________________
lustre-discuss mailing list
[email protected]
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org