I am afraid I am not experienced enough to be much more useful.

My guess is that, since client is writing synchronously to all node (to
keep data coherent), it's going as fast as the slowest brick.

Then small files are often slow because TCP windows doesn't have time to
grow up.
That's why I gave you some kernel tuning to help TCP Windows to get bigger
faster.

Do you use latest version (3.7.1) ?


Cordialement,
Mathieu CHATEAU
http://www.lotp.fr

2015-06-20 11:01 GMT+02:00 Geoffrey Letessier <[email protected]>:

> Hello Mathieu,
>
> Thanks for replying.
>
> Previously, i’ve never notice such throughput performances (around 1GBs
> for 1 big file) but.... The situation with a « big » set of small files
> wasn’t amazing but not such bad than today.
>
> The problem seems to concern exclusively the size of each file.
> "proof":
> [root@node056 tmp]# dd if=/dev/zero of=masterfile bs=1M count=1000
> 1000+0 enregistrements lus
> 1000+0 enregistrements écrits
> 1048576000 octets (1,0 GB) copiés, 2,09139 s, 501 MB/s
> [root@node056 tmp]# time split -b 1000000 -a 12 masterfile  # 1MB per file
>
> real 0m42.841s
> user 0m0.004s
> sys 0m1.416s
> [root@node056 tmp]# rm -f xaaaaaaaaa* && sync
> [root@node056 tmp]# time split -b 5000000 -a 12 masterfile  # 5 MB per
> file
>
> real 0m17.801s
> user 0m0.008s
> sys 0m1.396s
> [root@node056 tmp]# rm -f xaaaaaaaaa* && sync
> [root@node056 tmp]# time split -b 10000000 -a 12 masterfile  # 10MB per
> file
>
> real 0m9.686s
> user 0m0.008s
> sys 0m1.451s
> [root@node056 tmp]# rm -f xaaaaaaaaa* && sync
> [root@node056 tmp]# time split -b 20000000 -a 12 masterfile  # 20MB per
> file
>
> real 0m9.717s
> user 0m0.003s
> sys 0m1.399s
> [root@node056 tmp]# rm -f xaaaaaaaaa* && sync
> [root@node056 tmp]# time split -b 1000000 -a 12 masterfile  # 10MB per
> file
>
> real 0m40.283s
> user 0m0.007s
> sys 0m1.390s
> [root@node056 tmp]# rm -f xaaaaaaaaa* && sync
>
> Higher is the generated file size, best is the performance (IO throughput
> and running time)… ifstat output is coherent from both client/node and
> server side..
>
> a new test:
> [root@node056 tmp]# dd if=/dev/zero of=masterfile bs=1M count=10000
> 10000+0 enregistrements lus
> 10000+0 enregistrements écrits
> 10485760000 octets (10 GB) copiés, 23,0044 s, 456 MB/s
> [root@node056 tmp]# rm -f xaaaaaaaaa* && sync
> [root@node056 tmp]# time split -b 10000000 -a 12 masterfile  # 10MB per
> file
>
> real 1m43.216s
> user 0m0.038s
> sys 0m13.407s
>
>
> So the performance per file is the same (despite of 10x more files)
>
> So, i dont understand why, to get the best performance, i need to create
> file with a size of 10MB or more.
>
> Here are my volume reconfigured options:
> performance.cache-max-file-size: 64MB
> performance.read-ahead: on
> performance.write-behind: on
> features.quota-deem-statfs: on
> performance.stat-prefetch: on
> performance.flush-behind: on
> features.default-soft-limit: 90%
> features.quota: on
> diagnostics.brick-log-level: CRITICAL
> auth.allow: localhost,127.0.0.1,10.*
> nfs.disable: on
> performance.cache-size: 1GB
> performance.write-behind-window-size: 4MB
> performance.quick-read: on
> performance.io-cache: on
> performance.io-thread-count: 64
> nfs.enable-ino32: off
>
> It’s not a local cache trouble because:
> 1- it’s disabled in my mount command mount -t glusterfs -o transport=rdma,
> direct-io-mode=disable,enable-ino32 ib-storage1:vol_home /home
> 2- i made my test also playing with /proc/sys/vm/drop_caches
> 3- I note the same ifstat output from both client and server side which is
> coherent with the computing of bandwidth (file sizes / time (considering
> the replication).
>
> I think it’s not an infiniband network trouble but here are my [not
> default] settings:
> connected mode with MTU set to 65520
>
> Do you confirm my feelings? If yes, do you have any other idea?
>
> Thanks again and thanks by advance,
> Geoffrey
> -----------------------------------------------
> Geoffrey Letessier
>
> Responsable informatique & ingénieur système
> CNRS - UPR 9080 - Laboratoire de Biochimie Théorique
> Institut de Biologie Physico-Chimique
> 13, rue Pierre et Marie Curie - 75005 Paris
> Tel: 01 58 41 50 93 - eMail: [email protected]
>
> Le 20 juin 2015 à 09:12, Mathieu Chateau <[email protected]> a
> écrit :
>
> Hello,
>
> for the replicated one, is it a new issue or you just didn't notice before
> ? Same baseline as before?
>
> I also have slowness with small files/many files.
>
> For now I could only tune up things with:
>
> On gluster level:
> gluster volume set myvolume performance.io-thread-count 16
> gluster volume set myvolume  performance.cache-size 1GB
> gluster volume set myvolume nfs.disable on
> gluster volume set myvolume readdir-ahead enable
> gluster volume set myvolume read-ahead disable
>
> On network level (client and server) (I don't have infiniband):
> sysctl -w vm.swappiness=0
> sysctl -w net.core.rmem_max=67108864
> sysctl -w net.core.wmem_max=67108864
> # increase Linux autotuning TCP buffer limit to 32MB
> sysctl -w net.ipv4.tcp_rmem="4096 87380 33554432"
> sysctl -w net.ipv4.tcp_wmem="4096 65536 33554432"
> # increase the length of the processor input queue
> sysctl -w net.core.netdev_max_backlog=30000
> # recommended default congestion control is htcp
> sysctl -w net.ipv4.tcp_congestion_control=htcp
>
> But it's still really slow, even if better
>
> Cordialement,
> Mathieu CHATEAU
> http://www.lotp.fr
>
> 2015-06-20 2:34 GMT+02:00 Geoffrey Letessier <[email protected]>:
>
>> Re,
>>
>> For comparison, here is the output of the same script run on a
>> distributed only volume (2 servers of the 4 previously described, 2 bricks
>> each):
>> #######################################################
>> ################  UNTAR time consumed  ################
>> #######################################################
>>
>>
>> real 1m44.698s
>> user 0m8.891s
>> sys 0m8.353s
>>
>> #######################################################
>> #################  DU time consumed  ##################
>> #######################################################
>>
>> 554M linux-4.1-rc6
>>
>> real 0m21.062s
>> user 0m0.100s
>> sys 0m1.040s
>>
>> #######################################################
>> #################  FIND time consumed  ################
>> #######################################################
>>
>> 52663
>>
>> real 0m21.325s
>> user 0m0.104s
>> sys 0m1.054s
>>
>> #######################################################
>> #################  GREP time consumed  ################
>> #######################################################
>>
>> 7952
>>
>> real 0m43.618s
>> user 0m0.922s
>> sys 0m3.626s
>>
>> #######################################################
>> #################  TAR time consumed  #################
>> #######################################################
>>
>>
>> real 0m50.577s
>> user 0m29.745s
>> sys 0m4.086s
>>
>> #######################################################
>> #################  RM time consumed  ##################
>> #######################################################
>>
>>
>> real 0m41.133s
>> user 0m0.171s
>> sys 0m2.522s
>>
>> The performances are amazing different!
>>
>> Geoffrey
>> -----------------------------------------------
>> Geoffrey Letessier
>>
>> Responsable informatique & ingénieur système
>> CNRS - UPR 9080 - Laboratoire de Biochimie Théorique
>> Institut de Biologie Physico-Chimique
>> 13, rue Pierre et Marie Curie - 75005 Paris
>> Tel: 01 58 41 50 93 - eMail: [email protected]
>>
>> Le 20 juin 2015 à 02:12, Geoffrey Letessier <[email protected]>
>> a écrit :
>>
>> Dear all,
>>
>> I just noticed on my main volume of my HPC cluster my IO operations
>> become impressively poor..
>>
>> Doing some file operations above a linux kernel sources compressed file,
>> the untar operation can take more than 1/2 hours for this file (roughly
>> 80MB and 52 000 files inside) as you read below:
>> #######################################################
>> ################  UNTAR time consumed  ################
>> #######################################################
>>
>>
>> real 32m42.967s
>> user 0m11.783s
>> sys 0m15.050s
>>
>> #######################################################
>> #################  DU time consumed  ##################
>> #######################################################
>>
>> 557M linux-4.1-rc6
>>
>> real 0m25.060s
>> user 0m0.068s
>> sys 0m0.344s
>>
>> #######################################################
>> #################  FIND time consumed  ################
>> #######################################################
>>
>> 52663
>>
>> real 0m25.687s
>> user 0m0.084s
>> sys 0m0.387s
>>
>> #######################################################
>> #################  GREP time consumed  ################
>> #######################################################
>>
>> 7952
>>
>> real 2m15.890s
>> user 0m0.887s
>> sys 0m2.777s
>>
>> #######################################################
>> #################  TAR time consumed  #################
>> #######################################################
>>
>>
>> real 1m5.551s
>> user 0m26.536s
>> sys 0m2.609s
>>
>> #######################################################
>> #################  RM time consumed  ##################
>> #######################################################
>>
>>
>> real 2m51.485s
>> user 0m0.167s
>> sys 0m1.663s
>>
>> For information, this volume is a distributed replicated one and is
>> composed by 4 servers with 2 bricks each. Each bricks is a 12-drives RAID6
>> vdisk with nice native performances (around 1.2GBs).
>>
>> In comparison, when I use DD to generate a 100GB file on the same volume,
>> my write throughput is around 1GB (client side) and 500MBs (server side)
>> because of replication:
>> Client side:
>> [root@node056 ~]# ifstat -i ib0
>>        ib0
>>  KB/s in  KB/s out
>>  3251.45  1.09e+06
>>  3139.80  1.05e+06
>>  3185.29  1.06e+06
>>  3293.84  1.09e+06
>> ...
>>
>> Server side:
>> [root@lucifer ~]# ifstat -i ib0
>>        ib0
>>  KB/s in  KB/s out
>> 561818.1   1746.42
>> 560020.3   1737.92
>> 526337.1   1648.20
>> 513972.7   1613.69
>> ...
>>
>> DD command:
>> [root@node056 ~]# dd if=/dev/zero of=/home/root/test.dd bs=1M
>> count=100000
>> 100000+0 enregistrements lus
>> 100000+0 enregistrements écrits
>> 104857600000 octets (105 GB) copiés, 202,99 s, 517 MB/s
>>
>> So this issue doesn’t seem coming from the network (which is Infiniband
>> technology in this case)
>>
>> You can find in attachments a set of files:
>> - mybench.sh: the bench script
>> - benches.txt: output of my "bench"
>> - profile.txt: gluster volume profile during the "bench"
>> - vol_status.txt: gluster volume status
>> - vol_info.txt: gluster volume info
>>
>> Can someone help me to fix it (it’s very critical because this volume is
>> on a HPC cluster in production).
>>
>> Thanks by advance,
>> Geoffrey
>> -----------------------------------------------
>> Geoffrey Letessier
>>
>> Responsable informatique & ingénieur système
>> CNRS - UPR 9080 - Laboratoire de Biochimie Théorique
>> Institut de Biologie Physico-Chimique
>> 13, rue Pierre et Marie Curie - 75005 Paris
>> Tel: 01 58 41 50 93 - eMail: [email protected]
>>  <benches.txt>
>> <mybench.sh>
>> <profile.txt>
>> <vol_info.txt>
>> <vol_status.txt>
>>
>>
>>
>> _______________________________________________
>> Gluster-users mailing list
>> [email protected]
>> http://www.gluster.org/mailman/listinfo/gluster-users
>>
>
>
>
_______________________________________________
Gluster-users mailing list
[email protected]
http://www.gluster.org/mailman/listinfo/gluster-users

Reply via email to