[Gluster-devel] review request - Change the way client uuid is built
Hi all, [1] might have implications across different components in the stack. Your reviews are requested. rpc : Change the way client uuid is built Problem: Today the main users of client uuid are protocol layers, locks, leases. Protocolo layers requires each client uuid to be unique, even across connects and disconnects. Locks and leases on the server side also use the same client uid which changes across graph switches and across file migrations. Which makes the graph switch and file migration tedious for locks and leases. As of today lock migration across graph switch is client driven, i.e. when a graph switches, the client reassociates all the locks(which were associated with the old graph client uid) with the new graphs client uid. This means flood of fops to get and set locks for each fd. Also file migration across bricks becomes even more difficult as client uuid for the same client, is different on the other brick. The exact set of issues exists for leases as well. Hence the solution: Make the migration of locks and leases during graph switch and migration, server driven instead of client driven. This can be achieved by changing the format of client uuid. Client uuid currently: %s(ctx uuid)-%s(protocol client name)-%d(graph id)%s(setvolume count/reconnect count) Proposed Client uuid: "CTX_ID:%s-GRAPH_ID:%d-PID:%d-HOST:%s-PC_NAME:%s-RECON_NO:%s" - CTX_ID: This is will be constant per client. - GRAPH_ID, PID, HOST, PC_NAME(protocol client name), RECON_NO(setvolume count) remains the same. With this, the first part of the client uuid, CTX_ID+GRAPH_ID remains constant across file migration, thus the migration is made easier. Locks and leases store only the first part CTX_ID+GRAPH_ID as their client identification. This means, when the new graph connects, the locks and leases xlator should walk through their database to update the client id, to have new GRAPH_ID. Thus the graph switch is made server driven and saves a lot of network traffic. Change-Id: Ia81d57a9693207cd325d7b26aee4593fcbd6482c BUG: 1369028 Signed-off-by: Poornima GSigned-off-by: Susant Palai [1] http://review.gluster.org/#/c/13901/10/ regards, Raghavendra ___ Gluster-devel mailing list Gluster-devel@gluster.org http://www.gluster.org/mailman/listinfo/gluster-devel
[Gluster-devel] relative ordering of writes to same file from two different fds
Hi all, This mail is to figure out the behavior of write to same file from two different fds. As Ryan quotes in one of comments, I think it’s not safe. in this case: 1. P1 write to F1 use FD1 2. after P1 write finish, P2 write to the same place use FD2 since they are not conflict with each other now, the order the 2 writes send to underlying fs is not determined. so the final data may be P1’s or P2’s. this semantics is not the same with linux buffer io. linux buffer io will make the second write cover the first one, this is to say the final data is P2’s. you can see it from linux NFS (as we are all network filesystem) fs/nfs/file.c:nfs_write_begin(), nfs will flush ‘incompatible’ request first before another write begin. the way 2 request is determine to be ‘incompatible’ is that they are from 2 different open fds. I think write-behind behaviour should keep the same with linux page cache. However, my understanding is that filesystems need not maintain the relative order of writes (as it received from vfs/kernel) on two different fds. Also, if we have to maintain the order it might come with increased latency. The increased latency can be because of having "newer" writes to wait on "older" ones. This wait can fill up write-behind buffer and can eventually result in a full write-behind cache and hence not able to "write-back" newer writes. * What does POSIX say about it? * How do other filesystems behave in this scenario? Also, the current write-behind implementation has the concept of "generation numbers". To quote from comment: uint64_t gen;/* Liability generation number. Represents the current 'state' of liability. Every new addition to the liability list bumps the generation number. a newly arrived request is only required to perform causal checks against the entries in the liability list which were present at the time of its addition. the generation number at the time of its addition is stored in the request and used during checks. the liability list can grow while the request waits in the todo list waiting for its dependent operations to complete. however it is not of the request's concern to depend itself on those new entries which arrived after it arrived (i.e, those that have a liability generation higher than itself) */ So, if a single thread is doing writes on two different fds, generation numbers are sufficient to enforce the relative ordering. If
Re: [Gluster-devel] Introducing Tendrl
On 09/21/2016 07:03 AM, Gerard Braad wrote: n Wed, Sep 21, 2016 at 11:58 AM, Dan Mickwrote: >Is it Tendrl or Tendryl? (or the actual word, which would be 'tendril' >and thus unambiguous and memorable)? So, I am not the only person being confused about this Sorry for injecting confusion around the spelling - I think it is just "tendrl": https://github.com/Tendrl/tendrl Regards, Ric ___ Gluster-devel mailing list Gluster-devel@gluster.org http://www.gluster.org/mailman/listinfo/gluster-devel
Re: [Gluster-devel] Introducing Tendrl
On Wed, Sep 21, 2016 at 11:58 AM, Dan Mickwrote: > Is it Tendrl or Tendryl? (or the actual word, which would be 'tendril' > and thus unambiguous and memorable)? So, I am not the only person being confused about this -- Gerard Braad | http://gbraad.nl [ Doing Open Source Matters ] ___ Gluster-devel mailing list Gluster-devel@gluster.org http://www.gluster.org/mailman/listinfo/gluster-devel
Re: [Gluster-devel] [Heketi] Mailing list
You are completely correct Jeff. We will move to a Google Group email list. I have updated Heketi site with the new information: https://github.com/heketi/heketi#community We will update gluster-devel when we continue working together, for example, on iSCSI and similar projects. Thanks all, - Luis - Original Message - From: "Jeff Darcy"To: "Luis Pabón" Cc: "gluster-devel" Sent: Tuesday, September 20, 2016 4:17:09 PM Subject: Re: [Gluster-devel] [Heketi] Mailing list > Hi gluster-devel, > At the Heketi project, we wanted to get better communication with the > GlusterFS community. We are a young project and didn't have our own > mailing list, so we asked if we could also be part gluster-devel mailing > list. The plan is to Heketi specific emails to gluster-devel using the > subject tag '[Heketi]'. This is what is done in OpenStack, where they > all share the same mailing list, and use the subject line tag for > separate projects. > I consider this a pilot, nothing is set in stone, but I wanted to ask > your opinion in the matter. Personally, I'd rather see Heketi get its own mailing list(s) forthwith. While it's fine for things that affect both projects to be crossposted, putting general (potentially non-Gluster-related) Heketi traffic on a Gluster mailing list has the following effects. * Gluster developers who have some interest in Heketi will have to "manually filter" which Heketi messages are actually relevant. * Gluster developers who have *no* interest in Heketi (yes, they exist) will have to set up more automatic filters. * Non-Gluster developers who want to follow Heketi will have to join a Gluster mailing list which has lots of stuff they couldn't care less about. * Searching for Heketi-related email gets weird, with lots of false positives on "Gluster" just because it's on our list. * Heketi developers might feel constrained in what they can say about Gluster, as compared to what they might say on a Heketi-specific list (even if public). IMO the best place for any project XYZ to have its discussions is on XYZ's own mailing list(s). ___ Gluster-devel mailing list Gluster-devel@gluster.org http://www.gluster.org/mailman/listinfo/gluster-devel
Re: [Gluster-devel] [Heketi] Mailing list
> Hi gluster-devel, > At the Heketi project, we wanted to get better communication with the > GlusterFS community. We are a young project and didn't have our own > mailing list, so we asked if we could also be part gluster-devel mailing > list. The plan is to Heketi specific emails to gluster-devel using the > subject tag '[Heketi]'. This is what is done in OpenStack, where they > all share the same mailing list, and use the subject line tag for > separate projects. > I consider this a pilot, nothing is set in stone, but I wanted to ask > your opinion in the matter. Personally, I'd rather see Heketi get its own mailing list(s) forthwith. While it's fine for things that affect both projects to be crossposted, putting general (potentially non-Gluster-related) Heketi traffic on a Gluster mailing list has the following effects. * Gluster developers who have some interest in Heketi will have to "manually filter" which Heketi messages are actually relevant. * Gluster developers who have *no* interest in Heketi (yes, they exist) will have to set up more automatic filters. * Non-Gluster developers who want to follow Heketi will have to join a Gluster mailing list which has lots of stuff they couldn't care less about. * Searching for Heketi-related email gets weird, with lots of false positives on "Gluster" just because it's on our list. * Heketi developers might feel constrained in what they can say about Gluster, as compared to what they might say on a Heketi-specific list (even if public). IMO the best place for any project XYZ to have its discussions is on XYZ's own mailing list(s). ___ Gluster-devel mailing list Gluster-devel@gluster.org http://www.gluster.org/mailman/listinfo/gluster-devel
Re: [Gluster-devel] Introducing Tendrl
On 09/20/2016 08:09 PM, Joe Julian wrote: Does this compare to ViPR? I am not a ViPR expert, you would have to poke John Mark Walker for that :) My assumption is that they might want to use these modules (from tendryl down to the ceph/gluster bits) to add support for ceph and gluster. Regards, Ric On September 20, 2016 9:52:54 AM PDT, Ric Wheelerwrote: On 09/20/2016 10:23 AM, Gerard Braad wrote: Hi Mrugesh, On Tue, Sep 20, 2016 at 3:10 PM, Mrugesh Karnik wrote: I'd like to introduce the Tendrl project. Tendrl aims to build a management interface for Ceph. We've pushed some documentation to the On Tue, Sep 20, 2016 at 3:15 PM, Mrugesh Karnik wrote: I'd like to introduce the Tendrl project. Tendrl aims to build a management interface for Gluster. We've pushed some documentation to It might help to introduce Tendrl as the "Universal Storage Manager'" with a possibility to either manage Ceph and/or Gluster. I understand you want specific feedback, but a clear definition of the tool would be helpful. (Apologies for reposting my response - gmail injected html into what I thought was a text reply and it bounced from ceph-devel.) Hi Gerard, I see the goal differently. It is better to think of tendryl as one component of a whole management application stack. At the bottom, we will have ceph specific components (ceph-mgr) and gluster specific components (glusterd), as well as other local storage/file system components like libstoragemgt and so on. Tendryl is the next layer up from that, but it itself is meant to be consumed by presentation layers. For a stand alone thing that we hope to use at Red Hat, there will be a universal storage manager stack with everything I mentioned above in it, as well as the GUI code. Other projects will hopefully find this useful enough and plug some or all of the components into other management stacks. From my point of view, the job is to try to provide as much as possible re-usable components that will be generically interesting to a wide variety of applications. It is definitely not about trying to make all storage stacks look the same and force artificial new names/concepts/etc on the users. Of course, any one application will tend to have a similar "skin" for UX elements to try and make it consistent for users. If we do it right, people passionate about Ceph but who don't care about Gluster will be able to be avoid getting tied up in something out of their interest. Same going the other way around for Gluster developers who don't care or know about Ceph. Over time, this might extend to other storage types like Samba or NFS Ganesha clusters, etc. Regards, Ric ___ Gluster-devel mailing list Gluster-devel@gluster.org http://www.gluster.org/mailman/listinfo/gluster-devel
[Gluster-devel] [Heketi] Mailing list
Hi gluster-devel, At the Heketi project, we wanted to get better communication with the GlusterFS community. We are a young project and didn't have our own mailing list, so we asked if we could also be part gluster-devel mailing list. The plan is to Heketi specific emails to gluster-devel using the subject tag '[Heketi]'. This is what is done in OpenStack, where they all share the same mailing list, and use the subject line tag for separate projects. I consider this a pilot, nothing is set in stone, but I wanted to ask your opinion in the matter. Regards, - Luis ___ Gluster-devel mailing list Gluster-devel@gluster.org http://www.gluster.org/mailman/listinfo/gluster-devel
Re: [Gluster-devel] Introducing Tendrl
Does this compare to ViPR? On September 20, 2016 9:52:54 AM PDT, Ric Wheelerwrote: >On 09/20/2016 10:23 AM, Gerard Braad wrote: >> Hi Mrugesh, >> >> On Tue, Sep 20, 2016 at 3:10 PM, Mrugesh Karnik >wrote: >>> I'd like to introduce the Tendrl project. Tendrl aims to build a >>> management interface for Ceph. We've pushed some documentation to >the >> On Tue, Sep 20, 2016 at 3:15 PM, Mrugesh Karnik >wrote: >>> I'd like to introduce the Tendrl project. Tendrl aims to build a >>> management interface for Gluster. We've pushed some documentation to >> It might help to introduce Tendrl as the "Universal Storage Manager'" >> with a possibility to either manage Ceph and/or Gluster. >> I understand you want specific feedback, but a clear definition of >the >> tool would be helpful. >> > >(Apologies for reposting my response - gmail injected html into what I >thought >was a text reply and it bounced from ceph-devel.) > >Hi Gerard, > >I see the goal differently. > >It is better to think of tendryl as one component of a whole management > >application stack. At the bottom, we will have ceph specific components > >(ceph-mgr) and gluster specific components (glusterd), as well as other >local >storage/file system components like libstoragemgt and so on. > >Tendryl is the next layer up from that, but it itself is meant to be >consumed by >presentation layers. For a stand alone thing that we hope to use at Red >Hat, >there will be a universal storage manager stack with everything I >mentioned >above in it, as well as the GUI code. > >Other projects will hopefully find this useful enough and plug some or >all of >the components into other management stacks. > >From my point of view, the job is to try to provide as much as possible > >re-usable components that will be generically interesting to a wide >variety of >applications. It is definitely not about trying to make all storage >stacks look >the same and force artificial new names/concepts/etc on the users. Of >course, >any one application will tend to have a similar "skin" for UX elements >to try >and make it consistent for users. > >If we do it right, people passionate about Ceph but who don't care >about Gluster >will be able to be avoid getting tied up in something out of their >interest. >Same going the other way around for Gluster developers who don't care >or know >about Ceph. Over time, this might extend to other storage types like >Samba or >NFS Ganesha clusters, etc. > >Regards, > >Ric > > > > >___ >Gluster-devel mailing list >Gluster-devel@gluster.org >http://www.gluster.org/mailman/listinfo/gluster-devel -- Sent from my Android device with K-9 Mail. Please excuse my brevity.___ Gluster-devel mailing list Gluster-devel@gluster.org http://www.gluster.org/mailman/listinfo/gluster-devel
Re: [Gluster-devel] Introducing Tendrl
On 09/20/2016 10:23 AM, Gerard Braad wrote: Hi Mrugesh, On Tue, Sep 20, 2016 at 3:10 PM, Mrugesh Karnikwrote: I'd like to introduce the Tendrl project. Tendrl aims to build a management interface for Ceph. We've pushed some documentation to the On Tue, Sep 20, 2016 at 3:15 PM, Mrugesh Karnik wrote: I'd like to introduce the Tendrl project. Tendrl aims to build a management interface for Gluster. We've pushed some documentation to It might help to introduce Tendrl as the "Universal Storage Manager'" with a possibility to either manage Ceph and/or Gluster. I understand you want specific feedback, but a clear definition of the tool would be helpful. (Apologies for reposting my response - gmail injected html into what I thought was a text reply and it bounced from ceph-devel.) Hi Gerard, I see the goal differently. It is better to think of tendryl as one component of a whole management application stack. At the bottom, we will have ceph specific components (ceph-mgr) and gluster specific components (glusterd), as well as other local storage/file system components like libstoragemgt and so on. Tendryl is the next layer up from that, but it itself is meant to be consumed by presentation layers. For a stand alone thing that we hope to use at Red Hat, there will be a universal storage manager stack with everything I mentioned above in it, as well as the GUI code. Other projects will hopefully find this useful enough and plug some or all of the components into other management stacks. From my point of view, the job is to try to provide as much as possible re-usable components that will be generically interesting to a wide variety of applications. It is definitely not about trying to make all storage stacks look the same and force artificial new names/concepts/etc on the users. Of course, any one application will tend to have a similar "skin" for UX elements to try and make it consistent for users. If we do it right, people passionate about Ceph but who don't care about Gluster will be able to be avoid getting tied up in something out of their interest. Same going the other way around for Gluster developers who don't care or know about Ceph. Over time, this might extend to other storage types like Samba or NFS Ganesha clusters, etc. Regards, Ric ___ Gluster-devel mailing list Gluster-devel@gluster.org http://www.gluster.org/mailman/listinfo/gluster-devel
Re: [Gluster-devel] Gluster and FreeBSD
> > We did get some regular contributions to have Gluster function on > > FreeBSD, but they seem to be more sporadic now. If nobody steps up, I > > would suggest to keep compiling on FreeBSD, but nothing more. Maybe at a > > later time someone shows more interest. > > > > NetBSD on the other hand already runs some of the regression tests. And > > it seems to hit valid problems in the code that we for whatever lucky > > reason do not hit (yet?) on Linux. I see some value in the NetBSD > > environment, and if the infra team with help from Manu can keep it > > up-to-date it would be good to have it running. > > +1 to this approach regarding BSDs. I'm ok with this approach. Can we then do the following? * Document the issues when deploying FreeBSD. * Remove any references to us officially supporting FreeBSD (are there any?) * Ask for contributions from the FreeBSD community, especially if someone wants to take over maintainership of the port. -- nigelb ___ Gluster-devel mailing list Gluster-devel@gluster.org http://www.gluster.org/mailman/listinfo/gluster-devel
Re: [Gluster-devel] [Heketi] How pushing to heketi happens - especially about squashing
- Original Message - > Hi Michael, > We have a new mailing list, it is gluster-devel with [Heketi] in the > subject. Hi Luis, I know, which is what triggered that other mail. :-) It is not really a new mailing-list but a [Heketi] tag to be used in an old mailing list, and it was also said to use it for 'gluster related' discussions in heketi and not for general heketi dev discussions, which kind of soft and unclear to me. I was suggesting a mailing list for *all* heketi development related discussions. This stuff here is not necessary interesting for the broader gluster development community. But since you pulled the mail over from the internal list, to gluster-devel, I'm going to reply here. :-) > On the concept of github. It is always interesting to compare what we > know and how we are used to something to something new. In Github, > we do not need to let Github squash at all. I was doing that as a 'pilot'. Yeah, my request was to drop that because it creates bad commit messages and may not be what the author intended. It fiddles with the author's patches, which a UI should never do imho. That's just scary. > The real method is for patches to be added to a PR, and if too many > patches are added, for the author to squash them, and send a new one. > This is documented in the Development Guide in the Wiki. Yeah, I am questioning that, because I think it is flawed: 1) "too many patches ==> squash" is imho the wrong decision point. A PR can have as many commits as the author wants, as long as each commit is logical and complete. I even generally encourage more atomic patches. So imho that rule 'many patches ==> squash' creates the wrong incentive for squashing. 2) After the squashing, the author should imho not create a new PR but update the existing one. It has all the context and history for the patchset. > The author should also note that their first patch/commit sent as > a PR is the information used as the PR. Lots of PRs are being sent > with almost no information, and I have let this happen because most > people are still ramping up. Note that the commit message is NOT the PR title. That only happens if you use github to squash... So if we don't use github to squash, the PR title and initial description are less important for the final result! > There is no reason why commit messages cannot be as detailed as > those from Gerrit. No idea what gerrit has to do with commit messages. Commit messages always come from the author, unless you let some sofware fiddle with it. ;-) > Here is an example: > https://github.com/heketi/heketi/pull/393 . I am a big fan of long commit messages, myself guilty of commits with a message much longer than the actual patch. And the PR title is by default only the message of the first commit in the series, I think. So one should add content describing the proposed patchset when creating the PR. Full agreement here. But I think we should not over-estimate the title/description of the PR. E.g. here are two examples of PRs that had a separate title/description and the actual more detailed info was in the commit messages of the patches: https://github.com/heketi/heketi/pull/477 https://github.com/heketi/heketi/pull/499 Those were mangled together by a github-squash-push. > The process to update changes is to update the forked > branch, and not to amend the same change. Amending makes it impossible > to determine the changes from patch to patch, and makes it extremely hard > on reviewers (me). Hmm. My request is exactly to do amends. It is just standard and good (imho) and even necessary git workflow. See below for more details on that under comments to point #3. > Here are my thoughts on your questions below: > > 1) The the review should not squash the authors commits unless >the author explicitly requests or approves that. > [lpabon] Absolutely. The pilot, although it worked well technically, > it confuses those who come from other source control systems. Sorry it was getting kinda jet-laggy late so my words were not the best... I wanted to say: The reviewer should not squash (or otherwise change) the author's commits unless explicitly requested or approved, irrespective of the tool used for doing the changes. > 2) We should avoid using github to merge because this creates >bad commit messages. > [lpabon] I'm not sure what you mean by this, but I would not > "avoid" github in any way. That is like saying "avoid Gerrit". I wanted to say: "In particular, and especially, we should not use Gerrit for doing squashes. It creates bad commit messages." And yeah, I do explicitly mean "Avoid that aspect of github." Like "stay away from any feature in github that change the original commits". (Not sure if there is more lurking. ;-) > 3) (As a consequence of the above,) If we push delta-patches >to update PRs, that can usually not be the final push, but >needs a final iteration of force-pushing an amended
Re: [Gluster-devel] Gluster and FreeBSD
On Tue, Sep 20, 2016 at 12:31 AM, Niels de Voswrote: > On Tue, Sep 20, 2016 at 09:16:54AM +0530, Nigel Babu wrote: >> On Fri, Sep 09, 2016 at 06:07:41PM +0530, Nithya Balachandran wrote: >> > Hi, >> > >> > I recently debugged a problem [1] where linkfiles were not created properly >> > a gluster volume created using bricks running UFS . Whenever a linkfile was >> > created, the sticky bit was not set on it causing the same file to be >> > listed twice. >> > >> > From https://www.freebsd.org/cgi/man.cgi?query=chmod=2 >> > >> > The FreeBSD VM system totally ignores the sticky bit (S_ISVTX) for >> > executables. On UFS-based file systems (FFS, LFS) the sticky bit may only >> > be set upon directories. >> > >> > Based on this I do not think we can support UFS bricks for gluster volumes. >> > However, I have not worked with FreeBSD so I would like folks who have to >> > let me know if this is correct or if there is something I am missing. >> > >> > I was able to force the sticky bit on a file using a testfile attached to >> > [1] but it is not straightforward and I am reluctant to propose this. >> > >> > Thanks, >> > Nithya >> > >> > [1] https://bugzilla.redhat.com/show_bug.cgi?id=1176011 >> >> Giving this thread a signal boost. We should think about this if we're going >> to >> continue to support *BSD. >> >> Emmanuel, I know you work on NetBSD, but do you have thoughts to add here? > > We did get some regular contributions to have Gluster function on > FreeBSD, but they seem to be more sporadic now. If nobody steps up, I > would suggest to keep compiling on FreeBSD, but nothing more. Maybe at a > later time someone shows more interest. > > NetBSD on the other hand already runs some of the regression tests. And > it seems to hit valid problems in the code that we for whatever lucky > reason do not hit (yet?) on Linux. I see some value in the NetBSD > environment, and if the infra team with help from Manu can keep it > up-to-date it would be good to have it running. +1 to this approach regarding BSDs. -Vijay ___ Gluster-devel mailing list Gluster-devel@gluster.org http://www.gluster.org/mailman/listinfo/gluster-devel
Re: [Gluster-devel] [Heketi] Block store related API design discussion
Awesome, Thanks guys. - Luis - Original Message - From: "Pranith Kumar Karampuri"To: "Niels de Vos" Cc: "Luis Pabón" , "gluster-devel" , "Stephen Watt" , "Ramakrishna Yekulla" , "Humble Chirammal" Sent: Tuesday, September 20, 2016 5:53:30 AM Subject: Re: [Gluster-devel] [Heketi] Block store related API design discussion On Mon, Sep 19, 2016 at 9:22 PM, Niels de Vos wrote: > On Mon, Sep 19, 2016 at 10:31:11AM -0400, Luis Pabón wrote: > > Using qemu is interesting, but the I/O should be using the IO path of > QEMU block API. If not, > > TCMU would not know how to work with QEMU dynamic QCOW2 files. > > > > Now, if TCMU already has this, then that would be great! > > It has a qcow2 header, maybe you guys are lucky! > https://github.com/open-iscsi/tcmu-runner/blob/master/qcow2.h Sent the earlier mail before seeing this mail :-). So yes, what we discussed is to see if this qemu in tcmu can internally use gfapi for doing the operations or not is something we are trying to find out. > > > Niels > > > > > - Luis > > > > - Original Message - > > From: "Prasanna Kalever" > > To: "Niels de Vos" > > Cc: "Luis Pabón" , "Stephen Watt" , > "gluster-devel" , "Ramakrishna Yekulla" < > rre...@redhat.com>, "Humble Chirammal" > > Sent: Monday, September 19, 2016 7:13:36 AM > > Subject: Re: [Gluster-devel] [Heketi] Block store related API design > discussion > > > > On Mon, Sep 19, 2016 at 4:09 PM, Niels de Vos wrote: > > > > > > On Mon, Sep 19, 2016 at 03:34:29PM +0530, Prasanna Kalever wrote: > > > > On Mon, Sep 19, 2016 at 10:13 AM, Niels de Vos > wrote: > > > > > On Tue, Sep 13, 2016 at 12:06:00PM -0400, Luis Pabón wrote: > > > > >> Very good points. Thanks Prasanna for putting this together. I > agree with > > > > >> your comments in that Heketi is the high level abstraction API > and it should have > > > > >> an API similar of what is described by Prasanna. > > > > >> > > > > >> I definitely do not think any File Api should be available in > Heketi, > > > > >> because that is an implementation of the Block API. The Heketi > API should > > > > >> be similar to something like OpenStack Cinder. > > > > >> > > > > >> I think that the actual management of the Volumes used for Block > storage > > > > >> and the files in them should be all managed by Heketi. How they > are > > > > >> actually created is still to be determined, but we could have > Heketi > > > > >> create them, or have helper programs do that. > > > > > > > > > > Maybe a tool like qemu-img? If whatever iscsi service understand > the > > > > > format (at the very least 'raw'), you could get functionality like > > > > > snapshots pretty simple. > > > > > > > > Niels, > > > > > > > > This is brilliant and subset of the Idea falls in one among my > > > > thoughts, only concern is about building dependencies of qemu with > > > > Heketi. > > > > But at an advantage of easy and cool snapshots solution. > > > > > > And well tested as I understand that oVirt is moving to use qemu-img as > > > well. Other tools are able to use the qcow2 format, maybe the iscsi > > > servce that gets used does so too. > > > > > > Has there already been a decision on what Heketi will configure as > iscsi > > > service? I am aware of the tgt [1] and LIO/TCMU [2] projects. > > > > Niels, > > > > yes we will be using TCMU (Kernel Module) and TCMU-runner (user space > > service) to expose file in Gluster volume as an iSCSI target. > > more at [1], [2] & [3] > > > > [1] https://pkalever.wordpress.com/2016/06/23/gluster- > solution-for-non-shared-persistent-storage-in-docker-container/ > > [2] https://pkalever.wordpress.com/2016/06/29/non-shared- > persistent-gluster-storage-with-kubernetes/ > > [3] https://pkalever.wordpress.com/2016/08/16/read-write- > once-persistent-storage-for-openshift-origin-using-gluster/ > > > > -- > > Prasanna > > > > > > > > Niels > > > > > > 1. http://stgt.sourceforge.net/ > > > 2. https://github.com/open-iscsi/tcmu-runner > > >http://blog.gluster.org/2016/04/using-lio-with-gluster/ > > > > > > > > > > > -- > > > > Prasanna > > > > > > > > > > > > > > Niels > > > > > > > > > > > > > > >> We also need to document the exact workflow to enable a file in > > > > >> a Gluster volume to be exposed as a block device. This will help > > > > >> determine where the creation of the file could take place. > > > > >> > > > > >> We can capture our decisions from these discussions in the > > > > >> following page: > > > > >> > > > > >> https://github.com/heketi/heketi/wiki/Proposed-Changes > > > > >> > > > > >> - Luis > > > > >> > > > > >> > > > > >> - Original Message - > > > > >> From: "Humble Chirammal"
[Gluster-devel] [Heketi] How pushing to heketi happens - especially about squashing
Hi Michael, We have a new mailing list, it is gluster-devel with [Heketi] in the subject. I probably will add this to the communications wiki page. On the concept of github. It is always interesting to compare what we know and how we are used to something to something new. In Github, we do not need to let Github squash at all. I was doing that as a 'pilot'. The real method is for patches to be added to a PR, and if too many patches are added, for the author to squash them, and send a new one. This is documented in the Development Guide in the Wiki. The author should also note that their first patch/commit sent as a PR is the information used as the PR. Lots of PRs are being sent with almost no information, and I have let this happen because most people are still ramping up. There is no reason why commit messages cannot be as detailed as those from Gerrit. Here is an example: https://github.com/heketi/heketi/pull/393 . The process to update changes is to update the forked branch, and not to amend the same change. Amending makes it impossible to determine the changes from patch to patch, and makes it extremely hard on reviewers (me). Here are my thoughts on your questions below: 1) The the review should not squash the authors commits unless the author explicitly requests or approves that. [lpabon] Absolutely. The pilot, although it worked well technically, it confuses those who come from other source control systems. 2) We should avoid using github to merge because this creates bad commit messages. [lpabon] I'm not sure what you mean by this, but I would not "avoid" github in any way. That is like saying "avoid Gerrit". 3) (As a consequence of the above,) If we push delta-patches to update PRs, that can usually not be the final push, but needs a final iteration of force-pushing an amended patchset. [lpabon] Do not amend patches. NOTE on amended patches. If I notice another one, I will *not* merge the change. Sorry to be a pain about that, but it makes it almost impossible to review. This is not Gerrit, this is Github, it is something new, but in my opinion, it is more natural git workflow. - Luis - Original Message - From: "Michael Adam"To:"Luis Pabón" Sent: Tuesday, September 20, 2016 4:50:01 AM Subject: [RFC] [upstream] How pushing to heketi happens - especially about squashing Hi all, hi Luis, Since we do not have a real upstream ML yet (see my other mail), I want use this list now for discussing about the way patches are merged into heketi upstream. [ tl;dr ? --> look for "summing up" at the bottom... ;-) ] This is after a few weeks of working on the projects with you all especially with Luis, and seeing how he does the project. And there have been a few surprises on both ends. While I still don't fully like or trust the github UI, it is for instance better than gerrit (But as José sais: "That bar is really low..." ;-). One point where it is better is it can deal with patchsets, i.e. multiple patches submitted as one PR. But github has the feature of instead of squashing the patches instead of merging the patches as they are. This can be useful or remotely correct in one situation, but I think generally it should be avoided for reasons detailed below. So in this mail, I am sharing a few observations from the past few weeks, and a few concerns or problems I am having. I think it is important with the growing team to clearly formulate how both reviewers and patch submitters expect the process to work. At least when I propose a patchset, I propose it exactly the way I send it. Coming from Samba and Gluster development, for me as a contributor and as a reviewer, the content of the commits, i.e. the actual diffs as well as the layout into patches and the commit messages are 'sacred' in the sense that this is what the patch submitter proposed and signed-off on for pushing. Hence the reviewer should imho not change the general layout of patches (e.g. by squashing them) without consulting the author. Here are two examples where pull request with two patches were squashed with the heketi method: https://github.com/heketi/heketi/commit/bbc513ef214c5ec81b6cdb0a3a024944c9fe12ba https://github.com/heketi/heketi/commit/bccab2ee8f70f6862d9bfee3a8cbdf6e47b5a8bf You see what github does: it prints the title of the PR as main commit message and creates a bullet list of the original commit messages. Hence, it really creates pretty bad commits (A commit called "Two minor patches (#499)" - really??)... :-) This is not how these were intended by the authors. The actual result of how the commits looks like in git after they have been merged. (Btw, I don't look at the git log / code in github: it is difficult to see the relevant things there. I look at it in a local git checkout in the shell. This is the "everlasting", "sacred" content.) So this tells me that: 1) Patches should not be squashed without consulting the author,
Re: [Gluster-devel] Multiplexing - good news, bad news, and a plea for help
> If I understood brick-multiplexing correctly, add-brick/remove-brick > add/remove graphs right? I don't think the grah-cleanup is in good > shape, i.e. it should lead to memory leaks etc. Did you get a chance > to think about it? I haven't tried to address memory leaks specifically, but most of my work has been fixing bugs have been latent for ages but weren't biting us for one reason or another. For example: * Clients weren't reconnecting properly if setvolume failed (as opposed to the connection itself failing). * FUSE wasn't updating to use a new graph in the proper place, causing requests to be sent down to the old graph (where they'd get stuck). Those might sound simple, but each required hours of debugging to arrive at a diagnosis and fix - and there are several more. When I'm done fixing these kinds of preexisting functional problems, I'll look into preexisting memory leaks. Thanks for the heads-up. ___ Gluster-devel mailing list Gluster-devel@gluster.org http://www.gluster.org/mailman/listinfo/gluster-devel
[Gluster-devel] Minutes: Gluster Community Bug Triage meeting (Today)
Hi all, Today's triage meeting has been postponed to next week. Meeting ended Tue Sep 20 12:06:01 2016 UTC. Information about MeetBot at http://wiki.debian.org/MeetBot . Minutes: https://meetbot.fedoraproject.org/gluster-meeting/2016-09-20/gluster_community_bug_triage_meeting.2016-09-20-12.00.html Minutes (text): https://meetbot.fedoraproject.org/gluster-meeting/2016-09-20/gluster_community_bug_triage_meeting.2016-09-20-12.00.txt Log: https://meetbot.fedoraproject.org/gluster-meeting/2016-09-20/gluster_community_bug_triage_meeting.2016-09-20-12.00.log.html - Forwarded Message - > From: "Hari Gowtham"> To: "gluster-devel" > Sent: Tuesday, September 20, 2016 3:13:19 PM > Subject: [Gluster-devel] REMINDER: Gluster Community Bug Triage meeting > (Today) > > Hi all, > > The weekly Gluster bug triage is about to take place in two hours > > Meeting details: > - location: #gluster-meeting on Freenode IRC > ( https://webchat.freenode.net/?channels=gluster-meeting ) > - date: every Tuesday > - time: 12:00 UTC > (in your terminal, run: date -d "12:00 UTC") > - agenda: https://public.pad.fsfe.org/p/gluster-bug-triage > > Currently the following items are listed: > * Roll Call > * Status of last weeks action items > * Group Triage > * Open Floor > > Appreciate your participation. > > -- > Regards, > Hari. > > -- Regards, Hari. ___ Gluster-devel mailing list Gluster-devel@gluster.org http://www.gluster.org/mailman/listinfo/gluster-devel
Re: [Gluster-devel] Multiplexing - good news, bad news, and a plea for help
> That's weird, since the only purpose of the mem-pool was precisely to > improve performance of allocation of objects that are frequently > allocated/released. Very true, and I've long been an advocate of this approach. Unfortunately, for this to work our allocator has to be more efficient than the system's, and it's not - especially wrt locking. Overhead is high and contention is even higher, heavily outweighing any advantage. Unless/until we put in the work to make mem-pools perform better at high thread counts, avoiding them seems like the practical choice. > * Consider http://review.gluster.org/15036/. With all communications > going through the same socket, the problem this patch tries to solve > could become worse. I'll look into this. Thanks! > * We should consider the possibility of implementing a global thread > pool, which would replace io-threads, epoll threads and maybe others. > Synctasks should also rely on this thread pool. This has the benefit > of better controlling the total number of threads. Otherwise when we > have more threads than processor cores, we waste resources > unnecessarily and we won't get a real gain. Even worse, it could start > to degrade due to contention. Also a good idea, though perhaps too hard/complex to tackle in the short term. I did take a stab at making io-threads use a single global set of queues instead of per instance, to address a similar concern. To make a long story short, it didn't seem to make things any better for this test. I still think it's a good idea, though. > * There are *too many* mutexes in the code. Hear, hear. > We should drastically reduce its use. Sometimes by using better > structures that do not require blocking at all or even introducing RCU > and/or rwlocks. One case that I've always had doubts is dict_t. Why > does it need locks ? Once xlator should not modify a dict_t once it > has been passed to another xlator, and if we assume that a dict can > only be modified by a single xlator at a time, it's very unlikely that > it needs to modify it from multiple threads. I think in general you're right about dicts, but I also think it would be interesting to disable dict locking and see what breaks. I'll bet there's something *somewhere* that tries to access dicts concurrently. Callbacks for children of a cluster translator using the "fan out" pattern seem particularly suspect. What worries me is the classic problem with race conditions; it's easy to have something that *appears* to work when things aren't running in parallel enough to hit tiny timing windows, but it's a lot harder to be *sure* you're safe even when they do. I think I'd lean toward a more conservative approach of finding the particularly egregious high-contention cases, examining those particular code paths carefully, and changing them to use a lock-free dict variant or alternative. ___ Gluster-devel mailing list Gluster-devel@gluster.org http://www.gluster.org/mailman/listinfo/gluster-devel
Re: [Gluster-devel] Introducing Tendrl
Tendryl is a component that can be used as part of what would be a new management tool, but the project scope is not that of the whole stack that would form a universal storage manager. Regards, Ric On Sep 20, 2016 10:23, "Gerard Braad"wrote: > Hi Mrugesh, > > On Tue, Sep 20, 2016 at 3:10 PM, Mrugesh Karnik > wrote: > > I'd like to introduce the Tendrl project. Tendrl aims to build a > > management interface for Ceph. We've pushed some documentation to the > > On Tue, Sep 20, 2016 at 3:15 PM, Mrugesh Karnik > wrote: > > I'd like to introduce the Tendrl project. Tendrl aims to build a > > management interface for Gluster. We've pushed some documentation to > > It might help to introduce Tendrl as the "Universal Storage Manager'" > with a possibility to either manage Ceph and/or Gluster. > I understand you want specific feedback, but a clear definition of the > tool would be helpful. > > > regards, > > > Gerard > ___ > Gluster-devel mailing list > Gluster-devel@gluster.org > http://www.gluster.org/mailman/listinfo/gluster-devel > ___ Gluster-devel mailing list Gluster-devel@gluster.org http://www.gluster.org/mailman/listinfo/gluster-devel
Re: [Gluster-devel] [Heketi] Block store related API design discussion
On Mon, Sep 19, 2016 at 9:22 PM, Niels de Voswrote: > On Mon, Sep 19, 2016 at 10:31:11AM -0400, Luis Pabón wrote: > > Using qemu is interesting, but the I/O should be using the IO path of > QEMU block API. If not, > > TCMU would not know how to work with QEMU dynamic QCOW2 files. > > > > Now, if TCMU already has this, then that would be great! > > It has a qcow2 header, maybe you guys are lucky! > https://github.com/open-iscsi/tcmu-runner/blob/master/qcow2.h Sent the earlier mail before seeing this mail :-). So yes, what we discussed is to see if this qemu in tcmu can internally use gfapi for doing the operations or not is something we are trying to find out. > > > Niels > > > > > - Luis > > > > - Original Message - > > From: "Prasanna Kalever" > > To: "Niels de Vos" > > Cc: "Luis Pabón" , "Stephen Watt" , > "gluster-devel" , "Ramakrishna Yekulla" < > rre...@redhat.com>, "Humble Chirammal" > > Sent: Monday, September 19, 2016 7:13:36 AM > > Subject: Re: [Gluster-devel] [Heketi] Block store related API design > discussion > > > > On Mon, Sep 19, 2016 at 4:09 PM, Niels de Vos wrote: > > > > > > On Mon, Sep 19, 2016 at 03:34:29PM +0530, Prasanna Kalever wrote: > > > > On Mon, Sep 19, 2016 at 10:13 AM, Niels de Vos > wrote: > > > > > On Tue, Sep 13, 2016 at 12:06:00PM -0400, Luis Pabón wrote: > > > > >> Very good points. Thanks Prasanna for putting this together. I > agree with > > > > >> your comments in that Heketi is the high level abstraction API > and it should have > > > > >> an API similar of what is described by Prasanna. > > > > >> > > > > >> I definitely do not think any File Api should be available in > Heketi, > > > > >> because that is an implementation of the Block API. The Heketi > API should > > > > >> be similar to something like OpenStack Cinder. > > > > >> > > > > >> I think that the actual management of the Volumes used for Block > storage > > > > >> and the files in them should be all managed by Heketi. How they > are > > > > >> actually created is still to be determined, but we could have > Heketi > > > > >> create them, or have helper programs do that. > > > > > > > > > > Maybe a tool like qemu-img? If whatever iscsi service understand > the > > > > > format (at the very least 'raw'), you could get functionality like > > > > > snapshots pretty simple. > > > > > > > > Niels, > > > > > > > > This is brilliant and subset of the Idea falls in one among my > > > > thoughts, only concern is about building dependencies of qemu with > > > > Heketi. > > > > But at an advantage of easy and cool snapshots solution. > > > > > > And well tested as I understand that oVirt is moving to use qemu-img as > > > well. Other tools are able to use the qcow2 format, maybe the iscsi > > > servce that gets used does so too. > > > > > > Has there already been a decision on what Heketi will configure as > iscsi > > > service? I am aware of the tgt [1] and LIO/TCMU [2] projects. > > > > Niels, > > > > yes we will be using TCMU (Kernel Module) and TCMU-runner (user space > > service) to expose file in Gluster volume as an iSCSI target. > > more at [1], [2] & [3] > > > > [1] https://pkalever.wordpress.com/2016/06/23/gluster- > solution-for-non-shared-persistent-storage-in-docker-container/ > > [2] https://pkalever.wordpress.com/2016/06/29/non-shared- > persistent-gluster-storage-with-kubernetes/ > > [3] https://pkalever.wordpress.com/2016/08/16/read-write- > once-persistent-storage-for-openshift-origin-using-gluster/ > > > > -- > > Prasanna > > > > > > > > Niels > > > > > > 1. http://stgt.sourceforge.net/ > > > 2. https://github.com/open-iscsi/tcmu-runner > > >http://blog.gluster.org/2016/04/using-lio-with-gluster/ > > > > > > > > > > > -- > > > > Prasanna > > > > > > > > > > > > > > Niels > > > > > > > > > > > > > > >> We also need to document the exact workflow to enable a file in > > > > >> a Gluster volume to be exposed as a block device. This will help > > > > >> determine where the creation of the file could take place. > > > > >> > > > > >> We can capture our decisions from these discussions in the > > > > >> following page: > > > > >> > > > > >> https://github.com/heketi/heketi/wiki/Proposed-Changes > > > > >> > > > > >> - Luis > > > > >> > > > > >> > > > > >> - Original Message - > > > > >> From: "Humble Chirammal" > > > > >> To: "Raghavendra Talur" > > > > >> Cc: "Prasanna Kalever" , "gluster-devel" < > gluster-devel@gluster.org>, "Stephen Watt" , "Luis > Pabon" , "Michael Adam" , > "Ramakrishna Yekulla" , "Mohamed Ashiq Liyazudeen" < > mliya...@redhat.com> > > > > >> Sent: Tuesday, September 13, 2016 2:23:39 AM > > > > >> Subject: Re: [Gluster-devel]
Re: [Gluster-devel] [Heketi] Block store related API design discussion
On Mon, Sep 19, 2016 at 10:13 AM, Niels de Voswrote: > On Tue, Sep 13, 2016 at 12:06:00PM -0400, Luis Pabón wrote: > > Very good points. Thanks Prasanna for putting this together. I agree > with > > your comments in that Heketi is the high level abstraction API and it > should have > > an API similar of what is described by Prasanna. > > > > I definitely do not think any File Api should be available in Heketi, > > because that is an implementation of the Block API. The Heketi API > should > > be similar to something like OpenStack Cinder. > > > > I think that the actual management of the Volumes used for Block storage > > and the files in them should be all managed by Heketi. How they are > > actually created is still to be determined, but we could have Heketi > > create them, or have helper programs do that. > > Maybe a tool like qemu-img? If whatever iscsi service understand the > format (at the very least 'raw'), you could get functionality like > snapshots pretty simple. > Prasanna, Poornima and I just discussed about this. Prasanna is doing this experiment to see if we can use qcow from tcmu-runner to get this piece working. If yes, we definitely will get snapshots for free :-). Prasanna will confirm it based on his experiments. > > Niels > > > > We also need to document the exact workflow to enable a file in > > a Gluster volume to be exposed as a block device. This will help > > determine where the creation of the file could take place. > > > > We can capture our decisions from these discussions in the > > following page: > > > > https://github.com/heketi/heketi/wiki/Proposed-Changes > > > > - Luis > > > > > > - Original Message - > > From: "Humble Chirammal" > > To: "Raghavendra Talur" > > Cc: "Prasanna Kalever" , "gluster-devel" < > gluster-devel@gluster.org>, "Stephen Watt" , "Luis > Pabon" , "Michael Adam" , > "Ramakrishna Yekulla" , "Mohamed Ashiq Liyazudeen" < > mliya...@redhat.com> > > Sent: Tuesday, September 13, 2016 2:23:39 AM > > Subject: Re: [Gluster-devel] [Heketi] Block store related API design > discussion > > > > > > > > > > > > - Original Message - > > | From: "Raghavendra Talur" > > | To: "Prasanna Kalever" > > | Cc: "gluster-devel" , "Stephen Watt" < > sw...@redhat.com>, "Luis Pabon" , > > | "Michael Adam" , "Humble Chirammal" < > hchir...@redhat.com>, "Ramakrishna Yekulla" > > | , "Mohamed Ashiq Liyazudeen" > > | Sent: Tuesday, September 13, 2016 11:08:44 AM > > | Subject: Re: [Gluster-devel] [Heketi] Block store related API design > discussion > > | > > | On Mon, Sep 12, 2016 at 11:30 PM, Prasanna Kalever < > pkale...@redhat.com> > > | wrote: > > | > > | > Hi all, > > | > > > | > This mail is open for discussion on gluster block store integration > with > > | > heketi and its REST API interface design constraints. > > | > > > | > > > | > ___ Volume Request ... > > | > | > > | > | > > | > PVC claim -> Heketi --->| > > | > | > > | > | > > | > | > > | > | > > | > |__ BlockCreate > > | > | | > > | > | |__ BlockInfo > > | > | | > > | > |___ Block Request (APIS)-> |__ BlockResize > > | > | > > | > |__ BlockList > > | > | > > | > |__ BlockDelete > > | > > > | > Heketi will have block API and volume API, when user submit a > Persistent > > | > volume claim, Kubernetes provisioner based on the storage class(from > PVC) > > | > talks to heketi for storage, heketi intern calls block or volume > API's > > | > based on request. > > | > > > | > > | This is probably wrong. It won't be Heketi calling block or volume > APIs. It > > | would be Kubernetes calling block or volume API *of* Heketi. > > | > > | > > | > With my limited understanding, heketi currently creates clusters from > > | > provided nodes, creates volumes and handover them to the user. > > | > For block related API's, it has to deal with files right ? > > | > > > | > Here is how block API's look like in short- > > | > Create: heketi has to create file in the volume and export it as a > iscsi > > | > target device and hand it over to user. > > | > Info: show block stores information across all the clusters, >
[Gluster-devel] REMINDER: Gluster Community Bug Triage meeting (Today)
Hi all, The weekly Gluster bug triage is about to take place in two hours Meeting details: - location: #gluster-meeting on Freenode IRC ( https://webchat.freenode.net/?channels=gluster-meeting ) - date: every Tuesday - time: 12:00 UTC (in your terminal, run: date -d "12:00 UTC") - agenda: https://public.pad.fsfe.org/p/gluster-bug-triage Currently the following items are listed: * Roll Call * Status of last weeks action items * Group Triage * Open Floor Appreciate your participation. -- Regards, Hari. ___ Gluster-devel mailing list Gluster-devel@gluster.org http://www.gluster.org/mailman/listinfo/gluster-devel
Re: [Gluster-devel] Multiplexing - good news, bad news, and a plea for help
Jeff, If I understood brick-multiplexing correctly, add-brick/remove-brick add/remove graphs right? I don't think the grah-cleanup is in good shape, i.e. it should lead to memory leaks etc. Did you get a chance to think about it? On Mon, Sep 19, 2016 at 6:56 PM, Jeff Darcywrote: > I have brick multiplexing[1] functional to the point that it passes all > basic AFR, EC, and quota tests. There are still some issues with tiering, > and I wouldn't consider snapshots functional at all, but it seemed like a > good point to see how well it works. I ran some *very simple* tests with > 20 volumes, each 2x distribute on top of 2x replicate. > > First, the good news: it worked! Getting 80 bricks to come up in the same > process, and then run I/O correctly across all of those, is pretty cool. > Also, memory consumption is *way* down. RSS size went from 1.1GB before > (total across 80 processes) to about 400MB (one process) with > multiplexing. Each process seems to consume approximately 8MB globally > plus 5MB per brick, so (8+5)*80 = 1040 vs. 8+(5*80) = 408. Just > considering the amount of memory, this means we could support about three > times as many bricks as before. When memory *contention* is considered, > the difference is likely to be even greater. > > Bad news: some of our code doesn't scale very well in terms of CPU use. > To test performance I ran a test which would create 20,000 files across all > 20 volumes, then write and delete them, all using 100 client threads. This > is similar to what smallfile does, but deliberately constructed to use a > minimum of disk space - at any given, only one file per thread (maximum) > actually has 4KB worth of data in it. This allows me to run it against > SSDs or even ramdisks even with high brick counts, to factor out slow disks > in a study of CPU/memory issues. Here are some results and observations. > > * On my first run, the multiplexed version of the test took 77% longer to > run than the non-multiplexed version (5:42 vs. 3:13). And that was after > I'd done some hacking to use 16 epoll threads. There's something a bit > broken about trying to set that option normally, so that the value you set > doesn't actually make it to the place that tries to spawn the threads. > Bumping this up further to 32 threads didn't seem to help. > > * A little profiling showed me that we're spending almost all of our time > in pthread_spin_lock. I disabled the code to use spinlocks instead of > regular mutexes, which immediately improved performance and also reduced > CPU time by almost 50%. > > * The next round of profiling showed that a lot of the locking is in > mem-pool code, and a lot of that in turn is from dictionary code. Changing > the dict code to use malloc/free instead of mem_get/mem_put gave another > noticeable boost. > > At this point run time was down to 4:50, which is 20% better than where I > started but still far short of non-multiplexed performance. I can drive > that down still further by converting more things to use malloc/free. > There seems to be a significant opportunity here to improve performance - > even without multiplexing - by taking a more careful look at our > memory-management strategies: > > * Tune the mem-pool implementation to scale better with hundreds of > threads. > > * Use mem-pools more selectively, or even abandon them altogether. > > * Try a different memory allocator such as jemalloc. > > I'd certainly appreciate some help/collaboration in studying these options > further. It's a great opportunity to make a large impact on overall > performance without a lot of code or specialized knowledge. Even so, > however, I don't think memory management is our only internal scalability > problem. There must be something else limiting parallelism, and quite > severely at that. My first guess is io-threads, so I'll be looking into > that first, but if anybody else has any ideas please let me know. There's > no *good* reason why running many bricks in one process should be slower > than running them in separate processes. If it remains slower, then the > limit on the number of bricks and volumes we can support will remain > unreasonably low. Also, the problems I'm seeing here probably don't *only* > affect multiplexing. Excessive memory/CPU use and poor parallelism are > issues that we kind of need to address anyway, so if anybody has any ideas > please let me know. > > > > [1] http://review.gluster.org/#/c/14763/ > ___ > Gluster-devel mailing list > Gluster-devel@gluster.org > http://www.gluster.org/mailman/listinfo/gluster-devel > -- Pranith ___ Gluster-devel mailing list Gluster-devel@gluster.org http://www.gluster.org/mailman/listinfo/gluster-devel
Re: [Gluster-devel] Multiplexing - good news, bad news, and a plea for help
On 19/09/16 15:26, Jeff Darcy wrote: I have brick multiplexing[1] functional to the point that it passes all basic AFR, EC, and quota tests. There are still some issues with tiering, and I wouldn't consider snapshots functional at all, but it seemed like a good point to see how well it works. I ran some *very simple* tests with 20 volumes, each 2x distribute on top of 2x replicate. First, the good news: it worked! Getting 80 bricks to come up in the same process, and then run I/O correctly across all of those, is pretty cool. Also, memory consumption is *way* down. RSS size went from 1.1GB before (total across 80 processes) to about 400MB (one process) with multiplexing. Each process seems to consume approximately 8MB globally plus 5MB per brick, so (8+5)*80 = 1040 vs. 8+(5*80) = 408. Just considering the amount of memory, this means we could support about three times as many bricks as before. When memory *contention* is considered, the difference is likely to be even greater. Bad news: some of our code doesn't scale very well in terms of CPU use. To test performance I ran a test which would create 20,000 files across all 20 volumes, then write and delete them, all using 100 client threads. This is similar to what smallfile does, but deliberately constructed to use a minimum of disk space - at any given, only one file per thread (maximum) actually has 4KB worth of data in it. This allows me to run it against SSDs or even ramdisks even with high brick counts, to factor out slow disks in a study of CPU/memory issues. Here are some results and observations. * On my first run, the multiplexed version of the test took 77% longer to run than the non-multiplexed version (5:42 vs. 3:13). And that was after I'd done some hacking to use 16 epoll threads. There's something a bit broken about trying to set that option normally, so that the value you set doesn't actually make it to the place that tries to spawn the threads. Bumping this up further to 32 threads didn't seem to help. * A little profiling showed me that we're spending almost all of our time in pthread_spin_lock. I disabled the code to use spinlocks instead of regular mutexes, which immediately improved performance and also reduced CPU time by almost 50%. * The next round of profiling showed that a lot of the locking is in mem-pool code, and a lot of that in turn is from dictionary code. Changing the dict code to use malloc/free instead of mem_get/mem_put gave another noticeable boost. That's weird, since the only purpose of the mem-pool was precisely to improve performance of allocation of objects that are frequently allocated/released. At this point run time was down to 4:50, which is 20% better than where I started but still far short of non-multiplexed performance. I can drive that down still further by converting more things to use malloc/free. There seems to be a significant opportunity here to improve performance - even without multiplexing - by taking a more careful look at our memory-management strategies: * Tune the mem-pool implementation to scale better with hundreds of threads. * Use mem-pools more selectively, or even abandon them altogether. * Try a different memory allocator such as jemalloc. I'd certainly appreciate some help/collaboration in studying these options further. It's a great opportunity to make a large impact on overall performance without a lot of code or specialized knowledge. Even so, however, I don't think memory management is our only internal scalability problem. There must be something else limiting parallelism, and quite severely at that. My first guess is io-threads, so I'll be looking into that first, but if anybody else has any ideas please let me know. There's no *good* reason why running many bricks in one process should be slower than running them in separate processes. If it remains slower, then the limit on the number of bricks and volumes we can support will remain unreasonably low. Also, the problems I'm seeing here probably don't *only* affect multiplexing. Excessive memory/CPU use and poor parallelism are issues that we kind of need to address anyway, so if anybody has any ideas please let me know. You have made a really good job :) Some points I would look into: * Consider http://review.gluster.org/15036/. With all communications going through the same socket, the problem this patch tries to solve could become worse. * We should consider the possibility of implementing a global thread pool, which would replace io-threads, epoll threads and maybe others. Synctasks should also rely on this thread pool. This has the benefit of better controlling the total number of threads. Otherwise when we have more threads than processor cores, we waste resources unnecessarily and we won't get a real gain. Even worse, it could start to degrade due to contention. * There are *too many* mutexes in the code. We should
Re: [Gluster-devel] Jenkins Jobs on Gerrit
On Mon, Sep 12, 2016 at 08:04:37AM +0200, Niels de Vos wrote: > Ah, ok, so the repository in GitHub will not be a mirror of the one in > Gerrit that contains the JJB files? Do you plan to have the new > repository (in Gerrit) also push to a repository on GitHub? I do not at this point intend to push this repo to Github. Part of this is trying to use Gerrit as the single source truth for our inrfa-related repos. This is a commitment I made during the Gerrit upgrade. Part of the gaps to fill up will be covered by cgit so that you can visually see the repo through the web without cloning it. > It seems you posted an incomplete URL in the 1st email, the project in > Gerrit would be http://review.gluster.org/#/admin/projects/build-jobs Indeed, thank you for pointing to the right one! -- nigelb ___ Gluster-devel mailing list Gluster-devel@gluster.org http://www.gluster.org/mailman/listinfo/gluster-devel
Re: [Gluster-devel] Gluster and FreeBSD
On Tue, Sep 20, 2016 at 07:51:56AM +, Emmanuel Dreyfus wrote: > > An attempt to clarify some apparent confusion: Despite theit very similar > names, *BSD are not different distributions of the same software like > Linux distributions are. NetBSD and FreeBSD are distinct operating systems, > with theit own kernels and userlands that diverged from a common ancestor > 23 years ago. > > This is why you should not take FreeBSD behaviors for granted on NetBSD, > and vice-versa. Noted. I was making sure it didn't affect FreeBSD for sure. -- nigelb ___ Gluster-devel mailing list Gluster-devel@gluster.org http://www.gluster.org/mailman/listinfo/gluster-devel
Re: [Gluster-devel] Gluster and FreeBSD
On Tue, Sep 20, 2016 at 09:16:54AM +0530, Nigel Babu wrote: > Giving this thread a signal boost. We should think about this if we're going > to > continue to support *BSD. An attempt to clarify some apparent confusion: Despite theit very similar names, *BSD are not different distributions of the same software like Linux distributions are. NetBSD and FreeBSD are distinct operating systems, with theit own kernels and userlands that diverged from a common ancestor 23 years ago. This is why you should not take FreeBSD behaviors for granted on NetBSD, and vice-versa. -- Emmanuel Dreyfus m...@netbsd.org ___ Gluster-devel mailing list Gluster-devel@gluster.org http://www.gluster.org/mailman/listinfo/gluster-devel
Re: [Gluster-devel] Introducing Tendrl
Hi Mrugesh, On Tue, Sep 20, 2016 at 3:10 PM, Mrugesh Karnikwrote: > I'd like to introduce the Tendrl project. Tendrl aims to build a > management interface for Ceph. We've pushed some documentation to the On Tue, Sep 20, 2016 at 3:15 PM, Mrugesh Karnik wrote: > I'd like to introduce the Tendrl project. Tendrl aims to build a > management interface for Gluster. We've pushed some documentation to It might help to introduce Tendrl as the "Universal Storage Manager'" with a possibility to either manage Ceph and/or Gluster. I understand you want specific feedback, but a clear definition of the tool would be helpful. regards, Gerard ___ Gluster-devel mailing list Gluster-devel@gluster.org http://www.gluster.org/mailman/listinfo/gluster-devel
[Gluster-devel] Introducing Tendrl
Hi all, I'd like to introduce the Tendrl project. Tendrl aims to build a management interface for Gluster. We've pushed some documentation to the documentation repository[1]. The documentation should provide an understanding of the architecture and the components therein. This is still a work in progress. So please feel free to ask questions and make suggestions via the mailing list[2] and Github Issues[3]. There's an IRC channel[4] as well. Thanks. [1] https://github.com/Tendrl/documentation [2] https://www.redhat.com/mailman/listinfo/tendrl-devel [3] https://github.com/Tendrl/documentation/issues [4] #tendrl-devel on Freenode -- Mrugesh ___ Gluster-devel mailing list Gluster-devel@gluster.org http://www.gluster.org/mailman/listinfo/gluster-devel