On Apr 24, 2007 08:42 -0600, Daniel Leaberry wrote:
We're running 1.6b7 and have noticed the following two problems. I'm
wondering if they're correlated.
1. We get files that are 0 bytes. They have nothing in them.
This may or may not be related to the recent bug 12181 problem.
That bug will be fixed in 1.6.0+ and 1.4.10.1 and 1.4.11+.
It can also happen if the clients are evicted while they are
writing to the file.
I figured out why this happened but I'm not sure if my explanation is
valid. We run lustre as more of a general purpose filesystem but usually
with larger size files. We use autofs to mount and unmount filesystems.
The timeout is set to 120 seconds (after that much inactivity the
filesystem is unmounted)
On a particular machine that was being accessed infrequently and with
small files what I think happened is a batch of xml files would be
written, the metadata would be created on the MDS (hence the zero-byte
files), but because lustre is trying to optimize the rpcs for 1MB io's
and the client is doing caching the data wouldn't be written to the
OST's. Then autofs would unmount the filesystem without flushing the
write buffers (That doesn't make sense) and a few minutes later I would
get a client evicted message on the MDS. Since the client was evicted
all caches are flushed and the data was lost.
I'm not sure why autofs unmounting the filesystem wouldn't flush the
buffers and I'm also not sure why unmounting doesn't seem to inform the
MDS that the client is leaving. I know lustre probably isn't expecting
to be mounted and unmounted every 5 minutes but is this expected behavior?
2. We get these errors across our 30 nodes
LustreError: 7030:0:(dir.c:330:ll_readdir()) error reading dir
167108765/2378987153 page 13: rc -5
LustreError: 7029:0:(dir.c:330:ll_readdir()) error reading dir
171699532/2388399554 page 9: rc -5
LustreError: 7027:0:(dir.c:330:ll_readdir()) error reading dir
171403580/2387428410 page 2: rc -5
LustreError: 6990:0:(dir.c:330:ll_readdir()) error reading dir
171011300/2386583645 page 8: rc -5
LustreError: 7027:0:(dir.c:330:ll_readdir()) error reading dir
172286916/2390172901 page 13: rc -5
LustreError: 6990:0:(dir.c:330:ll_readdir()) error reading dir
172030180/2388919021 page 13: rc -5
LustreError: 7027:0:(dir.c:330:ll_readdir()) error reading dir
172321971/2390308492 page 3: rc -5
LustreError: 7027:0:(dir.c:330:ll_readdir()) error reading dir
163603484/1208913504 page 8: rc -5
LustreError: 6990:0:(dir.c:330:ll_readdir()) error reading dir
172748079/2390802528 page 13: rc -5
LustreError: 9133:0:(dir.c:330:ll_readdir()) error reading dir
172818070/2390892206 page 2: rc -5
LustreError: 9171:0:(dir.c:330:ll_readdir()) error reading dir
168359805/2380837293 page 8: rc -5
LustreError: 9187:0:(dir.c:330:ll_readdir()) error reading dir
163706128/1209056171 page 7: rc -5
LustreError: 9199:0:(dir.c:330:ll_readdir()) error reading dir
165116087/1211142674 page 0: rc -5
LustreError: 9217:0:(dir.c:330:ll_readdir()) error reading dir
162005170/1206582728 page 12: rc -5
LustreError: 9216:0:(dir.c:330:ll_readdir()) error reading dir
162686166/1207618778 page 12: rc -5
LustreError: 6990:0:(dir.c:330:ll_readdir()) error reading dir
163079284/1208141145 page 3: rc -5
These are reporting IO errors while reading directories from the MDS.
This isn't a problem I've seen before, it's hard to say what is the
root cause.
Is it possible the clients are just messed up? Especially since I get no
errors on the MDS? I suppose this might be due to our autofs
mount/umounting so many times.
Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.
_______________________________________________
Lustre-discuss mailing list
[email protected]
https://mail.clusterfs.com/mailman/listinfo/lustre-discuss