On Thu, Mar 21, 2019 at 5:43 PM Yaniv Kaul <[email protected]> wrote: > > > On Thu, Mar 21, 2019 at 5:23 PM Nithya Balachandran <[email protected]> > wrote: > >> >> >> On Thu, 21 Mar 2019 at 16:16, Atin Mukherjee <[email protected]> wrote: >> >>> All, >>> >>> In the last few releases of glusterfs, with stability as a primary theme >>> of the releases, there has been lots of changes done on the code >>> optimization with an expectation that such changes will have gluster to >>> provide better performance. While many of these changes do help, but off >>> late we have started seeing some diverse effects of them, one especially >>> being the calloc to malloc conversions. While I do understand that malloc >>> syscall will eliminate the extra memset bottleneck which calloc bears, but >>> with recent kernels having in-built strong compiler optimizations I am not >>> sure whether that makes any significant difference, but as I mentioned >>> earlier certainly if this isn't done carefully it can potentially introduce >>> lot of bugs and I'm writing this email to share one of such experiences. >>> >>> Sanju & I were having troubles for last two days to figure out why >>> https://review.gluster.org/#/c/glusterfs/+/22388/ wasn't working in >>> Sanju's system but it had no problems running the same fix in my gluster >>> containers. After spending a significant amount of time, what we now >>> figured out is that a malloc call [1] (which was a calloc earlier) is the >>> culprit here. As you all can see, in this function we allocate txn_id and >>> copy the event->txn_id into it through gf_uuid_copy () . But when we were >>> debugging this step wise through gdb, txn_id wasn't exactly copied with the >>> exact event->txn_id and it had some junk values which made the >>> glusterd_clear_txn_opinfo to be invoked with a wrong txn_id later on >>> resulting the leaks to remain the same which was the original intention of >>> the fix. >>> >>> This was quite painful to debug and we had to spend some time to figure >>> this out. Considering we have converted many such calls in past, I'd urge >>> that we review all such conversions and see if there're any side effects to >>> it. Otherwise we might end up running into many potential memory related >>> bugs later on. OTOH, going forward I'd request every patch >>> owners/maintainers to pay some special attention to these conversions and >>> see they are really beneficial and error free. IMO, general guideline >>> should be - for bigger buffers, malloc would make better sense but has to >>> be done carefully, for smaller size, we stick to calloc. >>> >> >>> What do others think about it? >>> >> >> I believe that replacing calloc with malloc everywhere without adequate >> testing and review is not safe and am against doing so for the following >> reasons: >> > > No patch should get in without adequate testing and thorough review. > >> >> 1. Most of these patches have not been tested, especially the error >> paths.I have seen some that introduced issues in error scenarios with >> pointers being non-null. >> >> > You raise an interesting issue. Why are free'd memory pointers are not > NULL'ified? Why does FREE() set ptr = (void *)0xeeeeeeee and not NULL? > This is a potential cause for failure. A re-occuring FREE(NULL) is > harmless. A FREE(0xeeeeeeee) is a bit more problematic. > > >> 1. >> 2. As Raghavendra said, the code assumes that certain elements will >> be initialized to null/zero and changing that can have consequences which >> are not immediately obvious. I think it might be worthwhile to go through >> the already merged calloc->malloc patches to check error paths and so on >> to >> see if they are safe. >> >> > Agreed. > >> >> 1. >> 2. Making such changes to the libglusterfs code while we are >> currently working to stabilize the product is not a good idea. The patches >> take time to review and any errors introduced in the core pieces affect >> all >> processes and require significant effort to debug. >> >> > Let me know when we consider the project stable. I'd argue the way to > stabilize it is not stop improving it, but improving its testing. From more > tests to cover more code via more tests to more static analysis coverage, > to ensuring CI is rock-solid (inc. random errors that pop up from time to > time). Not accepting patches to master is not the right approach, unless > it's time-boxed somehow. If it is, then it means we don't trust our CI > enough, btw. > > >> 1. >> >> Yaniv, while the example you provided might make sense to change to >> malloc, a lot of the other changes, in my opinion, do not for the effort >> required. For performance testing, smallfile might be a useful tool to see >> if any of the changes make a difference. That said, I am reluctant to take >> in patches that change core code significantly without being tested or >> providing proof of benefits. >> > > Smallfile is part of CI? I am happy to see it documented @ > https://docs.gluster.org/en/latest/Administrator%20Guide/Performance%20Testing/#smallfile-distributed-io-benchmark > , so at least one can know how to execute it manually. >
Following up the above link to the smallfile repo leads to 404 (I'm assuming we don't have a link checker running on our documentation, so it can break from time to time?) I assume it's https://github.com/distributed-system-analysis/smallfile ? Y. > >> We need better performance and scalability but that is going to need >> changes in our algorithms and fop handling and that is what we need to be >> working on. Such changes, when done right, will provide more benefits than >> the micro optimizations. I think it unlikely the micro optimizations will >> provide much benefit but am willing to be proven wrong if you have numbers >> that show otherwise. >> > > I call them nano-optimizations. I believe they really shave only that > much, microsecond would be great [looking at > https://people.eecs.berkeley.edu/~rcs/research/interactive_latency.html , > I'm in that ballpark figure.]. > Unless we got enough cache misses (btw, would be nice to add > likely()/unlikely() to our code, to give a hint to the compiler on code > paths). Not sure that helps that much as well, of course - same figures, I > estimate. > But I do like the fact they remove the false assumption that everything is > NULL and we don't need to worry about it - it means we need to look at our > structures (and while at it, align them!) and make sure we know what we > initialize. > > But I certainly understand the resistance and will abandon the patches if > not welcome. > Y. > > >> Regards, >> Nithya >> >> >>> >>> [1] >>> https://github.com/gluster/glusterfs/blob/master/xlators/mgmt/glusterd/src/glusterd-op-sm.c#L5681 >>> _______________________________________________ >>> maintainers mailing list >>> [email protected] >>> https://lists.gluster.org/mailman/listinfo/maintainers >>> >> _______________________________________________ >> Gluster-devel mailing list >> [email protected] >> https://lists.gluster.org/mailman/listinfo/gluster-devel > >
_______________________________________________ maintainers mailing list [email protected] https://lists.gluster.org/mailman/listinfo/maintainers
