Hello

We've been facing strange performance behaviors regarding to random I/O
on large files on a Lustre FS. Trying to decipher it, I take a deeper
look into the 'i_blksize' value used by Lustre, particularly in
ll_getattr_it() call.

The behavior is: some of our tools do many fseek-fread(few bytes) on
large files in huge quantity. Those tools face very poor performances
regarding to what could have been expected with Lustre traditional
performance on the same configuration.


f = fopen()
pos = somewhere
do
  fseek(f, pos)
  fread(10 bytes, f)
  pos = somewhere_else
loop


Analyzing the behavior, it appears that, for each small fread(), a
read of 2MB was done instead, by the glibc, this leading to poor
performance, seen from the program (even if the global Lustre throughput
was good, for each 2MB read, only few bytes were useful).
So, why the glibc read so much data while only few were requested by the
binary? It seems that this library uses the value 'f_bsize' returned by
statfs() as optimal readahead blocksize and use it as a buffer. Yet,
this value is computed by Lustre using:

[lustre/llite/llite_lib.c:1567]
inode->i_blkbits = min(PTLRPC_MAX_BRW_BITS+1, LL_MAX_BLKSIZE_BITS);

which is quite a huge value (2MB) and, as our tool does many fseek(),
this buffer is useless and is discard between each read.

We made a small patch thanks to we could change the value returned by
ll_getattr_it() (which filled the statfs struct) as a module parameter. The performances are really good using it when reducing the
value to 4KB.
Indeed, Lustre is really better managing and buffering itself the I/O
than the glibc in those cases. So, we are wondering whether it could not
be a good idea to reduce the default value, or using something like a
module param to tune it depending on the kind of I/O on the client.

What's your idea about this ?

--
Aurelien Degremont

_______________________________________________
Lustre-devel mailing list
[email protected]
https://mail.clusterfs.com/mailman/listinfo/lustre-devel

Reply via email to