This mail was not there in the same thread as earlier because the subject has extra "?==?utf-8?q? " so thought it was not answered and answered again. Sorry about that.
On Sat, Jun 3, 2017 at 1:45 AM, Xavier Hernandez <[email protected]> wrote: > Hi Serkan, > > On Thursday, June 01, 2017 21:31 CEST, Serkan Çoban <[email protected]> > wrote: > > > >Is it possible that this matches your observations ? > Yes that matches what I see. So 19 files is being in parallel by 19 > SHD processes. I thought only one file is being healed at a time. > Then what is the meaning of disperse.shd-max-threads parameter? If I > set it to 2 then each SHD thread will heal two files at the same time? > > Each SHD normally heals a single file at a time. However there's an SHD on > each node so all of them are trying to process dirty files. If one peeks > one file to heal, other SHD's will skip that one and try another. > > The disperse.shd-max-threads indicates how many heals can do each SHD > simultaneously. Setting a value of 2 would mean that each SHD could heal 2 > files, up to 40 using 20 nodes. > > > >How many IOPS can handle your bricks ? > Bricks are 7200RPM NL-SAS disks. 70-80 random IOPS max. But write > pattern seems sequential, 30-40MB bulk writes every 4-5 seconds. > This is what iostat shows. > > This is probably caused by some write back policy on the file system that > accumulates multiple writes, optimizing disk access. This way the apparent > 1000 IOPS can be handled by a single disk with 70-80 real IOPS by making > each IO operation bigger. > > > >Do you have a test environment where we could check all this ? > Not currently but will have in 4-5 weeks. New servers are arriving, I > will add this test to my notes. > > > There's a feature to allow to configure the self-heal block size to > optimize these cases. The feature is available on 3.11. > I did not see this in 3.11 release notes, what parameter name I should > look for? > > The new option is named 'self-heal-window-size'. It represents the size of > each heal operation as the number of 128KB blocks to use. The default value > is 1. To use blocks of 1MB, this parameter should be set to 8. > > Xavi > > > On Thu, Jun 1, 2017 at 10:30 AM, Xavier Hernandez <[email protected]> > wrote: > > Hi Serkan, > > > > On 30/05/17 10:22, Serkan Çoban wrote: > >> > >> Ok I understand that heal operation takes place on server side. In > >> this case I should see X KB > >> out network traffic from 16 servers and 16X KB input traffic to the > >> failed brick server right? So that process will get 16 chunks > >> recalculate our chunk and write it to disk. > > > > > > That should be the normal operation for a single heal. > > > >> The problem is I am not seeing such kind of traffic on servers. In my > >> configuration (16+4 EC) I see 20 servers are all have 7-8MB outbound > >> traffic and none of them has more than 10MB incoming traffic. > >> Only heal operation is happening on cluster right now, no client/other > >> traffic. I see constant 7-8MB write to healing brick disk. So where is > >> the missing traffic? > > > > > > Not sure about your configuration, but probably you are seeing the > result of > > having the SHD of each server doing heals. That would explain the network > > traffic you have. > > > > Suppose that all SHD but the one on the damaged brick are working. In > this > > case 19 servers will peek 16 fragments each. This gives 19 * 16 = 304 > > fragments to be requested. EC balances the reads among all available > > servers, and there's a chance (1/19) that a fragment is local to the > server > > asking it. So we'll need a total of 304 - 304 / 19 = 288 network > requests, > > 288 / 19 = 15.2 sent by each server. > > > > If we have a total of 288 requests, it means that each server will answer > > 288 / 19 = 15.2 requests. The net effect of all this is that each healthy > > server is sending 15.2*X bytes of data and each server is receiving > 15.2*X > > bytes of data. > > > > Now we need to account for the writes to the damaged brick. We have 19 > > simultaneous heals. This means that the damaged brick will receive 19*X > > bytes of data, and each healthy server will send X additional bytes of > data. > > > > So: > > > > A healthy server receives 15.2*X bytes of data > > A healthy server sends 16.2*X bytes of data > > A damaged server receives 19*X bytes of data > > A damaged server sends few bytes of data (communication and > synchronization > > overhead basically) > > > > As you can see, in this configuration each server has almost the same > amount > > of inbound and outbound traffic. Only big difference is the damaged > brick, > > that should receive a little more of traffic, but it should send much > less. > > > > Is it possible that this matches your observations ? > > > > There's one more thing to consider here, and it's the apparent low > > throughput of self-heal. One possible thing to check is the small size > and > > random behavior of the requests. > > > > Assuming that each request has a size of ~128 / 16 = 8KB, at a rate of ~8 > > MB/s the servers are processing ~1000 IOPS. Since requests are going to > 19 > > different files, even if each file is accessed sequentially, the real > effect > > will be like random access (some read-ahead on the filesystem can improve > > reads a bit, but writes won't benefit so much). > > > > How many IOPS can handle your bricks ? > > > > Do you have a test environment where we could check all this ? if > possible > > it would be interesting to have only a single SHD (kill all SHD from all > > servers but one). In this situation, without client accesses, we should > see > > the 16/1 ratio of reads vs writes on the network. We should also see a > > similar of even a little better speed because all reads and writes will > be > > sequential, optimizing available IOPS. > > > > There's a feature to allow to configure the self-heal block size to > optimize > > these cases. The feature is available on 3.11. > > > > Best regards, > > > > Xavi > > > > > >> > >> On Tue, May 30, 2017 at 10:25 AM, Ashish Pandey <[email protected]> > >> wrote: > >>> > >>> > >>> When we say client side heal or server side heal, we basically talking > >>> about > >>> the side which "triggers" heal of a file. > >>> > >>> 1 - server side heal - shd scans indices and triggers heal > >>> > >>> 2 - client side heal - a fop finds that file needs heal and it triggers > >>> heal > >>> for that file. > >>> > >>> Now, what happens when heal gets triggered. > >>> In both the cases following functions takes part - > >>> > >>> ec_heal => ec_heal_throttle=>ec_launch_heal > >>> > >>> Now ec_launch_heal just creates heal tasks (with ec_synctask_heal_wrap > >>> which > >>> calls ec_heal_do ) and put it into a queue. > >>> This happens on server and "syncenv" infrastructure which is nothing > but > >>> a > >>> set of workers pick these tasks and execute it. That is when actual > >>> read/write for > >>> heal happens. > >>> > >>> > >>> ________________________________ > >>> From: "Serkan Çoban" <[email protected]> > >>> To: "Ashish Pandey" <[email protected]> > >>> Cc: "Gluster Users" <[email protected]> > >>> Sent: Monday, May 29, 2017 6:44:50 PM > >>> Subject: Re: [Gluster-users] Heal operation detail of EC volumes > >>> > >>> > >>>>> Healing could be triggered by client side (access of file) or server > >>>>> side > >>>>> (shd). > >>>>> However, in both the cases actual heal starts from "ec_heal_do" > >>>>> function. > >>> > >>> If I do a recursive getfattr operation from clients, then all heal > >>> operation is done on clients right? Client read the chunks, calculate > >>> and write the missing chunk. > >>> And If I don't access files from client then SHD daemons will start > >>> heal and read,calculate,write the missing chunks right? > >>> > >>> In first case EC calculations takes places in client fuse process, in > >>> second case EC calculations will be made in SHD process right? > >>> Does brick process has any role in EC calculations? > >>> > >>> On Mon, May 29, 2017 at 3:32 PM, Ashish Pandey <[email protected]> > >>> wrote: > >>>> > >>>> > >>>> > >>>> ________________________________ > >>>> From: "Serkan Çoban" <[email protected]> > >>>> To: "Gluster Users" <[email protected]> > >>>> Sent: Monday, May 29, 2017 5:13:06 PM > >>>> Subject: [Gluster-users] Heal operation detail of EC volumes > >>>> > >>>> Hi, > >>>> > >>>> When a brick fails in EC, What is the healing read/write data path? > >>>> Which processes do the operations? > >>>> > >>>> Healing could be triggered by client side (access of file) or server > >>>> side > >>>> (shd). > >>>> However, in both the cases actual heal starts from "ec_heal_do" > >>>> function. > >>>> > >>>> > >>>> Assume a 2GB file is being healed in 16+4 EC configuration. I was > >>>> thinking that SHD deamon on failed brick host will read 2GB from > >>>> network and reconstruct its 100MB chunk and write it on to brick. Is > >>>> this right? > >>>> > >>>> You are correct about read/write. > >>>> The only point is that, SHD deamon on one of the good brick will pick > >>>> the > >>>> index entry and heal it. > >>>> SHD deamon scans the .glusterfs/index directory and heals the entries. > >>>> If > >>>> the brick went down while IO was going on, index will be present on > >>>> killed > >>>> brick also. > >>>> However, if a brick was down and then you started writing on a file > then > >>>> in > >>>> this case index entry would not be present on killed brick. > >>>> So even after brick will be UP, sdh on that brick will not be able to > >>>> find > >>>> it out this index. However, other bricks would have entries and shd on > >>>> that > >>>> brick will heal it. > >>>> > >>>> Note: I am considering each brick on different node. > >>>> > >>>> Ashish > >>>> > >>>> > >>>> > >>>> > >>>> > >>>> > >>>> > >>>> > >>>> > >>>> > >>>> _______________________________________________ > >>>> Gluster-users mailing list > >>>> [email protected] > >>>> http://lists.gluster.org/mailman/listinfo/gluster-users > >>>> > >>> _______________________________________________ > >>> Gluster-users mailing list > >>> [email protected] > >>> http://lists.gluster.org/mailman/listinfo/gluster-users > >>> > >> _______________________________________________ > >> Gluster-users mailing list > >> [email protected] > >> http://lists.gluster.org/mailman/listinfo/gluster-users > >> > > > > > > _______________________________________________ > Gluster-users mailing list > [email protected] > http://lists.gluster.org/mailman/listinfo/gluster-users > -- Pranith
_______________________________________________ Gluster-users mailing list [email protected] http://lists.gluster.org/mailman/listinfo/gluster-users
