On Sat, Jan 26, 2019 at 8:03 AM Raghavendra Gowdappa <rgowd...@redhat.com> wrote:
> > > On Fri, Jan 11, 2019 at 8:09 PM Raghavendra Gowdappa <rgowd...@redhat.com> > wrote: > >> Here is the update of the progress till now: >> * The client profile attached till now shows the tuple creation is >> dominated by writes and fstats. Note that fstats are side-effects of writes >> as writes invalidate attributes of the file from kernel attribute cache. >> * The rest of the init phase (which is marked by msgs "setting primary >> key" and "vaccuum") is dominated by reads. Next bigger set of operations >> are writes followed by fstats. >> >> So, only writes, reads and fstats are the operations we need to optimize >> to reduce the init time latency. As mentioned in my previous mail, I did >> following tunings: >> * Enabled only write-behind, md-cache and open-behind. >> - write-behind was configured with a cache-size/window-size of 20MB >> - open-behind was configured with read-after-open yes >> - md-cache was loaded as a child of write-behind in xlator graph. As >> a parent of write-behind, writes responses of writes cached in write-behind >> would invalidate stats. But when loaded as a child of write-behind this >> problem won't be there. Note that in both cases fstat would pass through >> write-behind (In the former case due to no stats in md-cache). However in >> the latter case fstats can be served by md-cache. >> - md-cache used to aggressively invalidate inodes. For the purpose of >> this test, I just commented out inode-invalidate code in md-cache. We need >> to fine tune the invalidation invocation logic. >> - set group-metadata-cache to on. But turned off upcall >> notifications. Note that since this workload basically accesses all its >> data through single mount point. So, there is no shared files across mounts >> and hence its safe to turn off invalidations. >> * Applied fix to https://bugzilla.redhat.com/show_bug.cgi?id=1648781 >> >> With the above set of tunings I could reduce the init time of scale 8000 >> from 16.6 hrs to 11.4 hrs - an improvement in the range 25% to 30% >> >> Since the workload is dominated by reads, we think a good read-cache >> where reads to regions just written are served from cache would greatly >> improve the performance. Since kernel page-cache already provides that >> functionality along with read-ahead (which is more intelligent and serves >> more read patterns than supported by Glusterfs read-ahead), we wanted to >> try that. But, Manoj found a bug where reads followed by writes are not >> served from page cache [5]. I am currently waiting for the resolution of >> this bug. As an alternative, I can modify io-cache to serve reads from the >> data just written. But, the change involves its challenges and hence would >> like to get a resolution on [5] (either positive or negative) before >> proceeding with modifications to io-cache. >> >> As to the rpc latency, Krutika had long back identified that reading a >> single rpc message involves atleast 4 reads to socket. These many number of >> reads were done to identify the structure of the message on the go. The >> reason we wanted to discover the rpc message was to identify the part of >> the rpc message containing read or write payload and make sure that payload >> is directly read into a buffer different than the one containing rest of >> the rpc message. This strategy will make sure payloads are not copied again >> when buffers are moved across caches (read-ahead, io-cache etc) and also >> the rest of the rpc message can be freed even though the payload outlives >> the rpc message (when payloads are cached). However, we can experiment an >> approach where we can either do away with zero-copy requirement or let the >> entire buffer containing rpc message and payload to live in the cache. >> >> From my observations and discussions with Manoj and Xavi, this workload >> is very sensitive to latency (than to concurrency). So, I am hopeful the >> above approaches will give positive results. >> > > Me, Manoj and Csaba figured out that invalidations by md-cache and Fuse > auto-invalidations were dropping the kernel page-cache (more details on > [5]). Changes to stats by writes from same client (local writes) were > triggering both these codepaths dropping the cache. Since all the I/O done > by this workload goes through the caches of single client, the > invalidations are not necessary and I made code changes to fuse-bridge to > disable auto-invalidations completely and commented out inode-invalidations > in md-cache. Note that this doesn't regress the consistency/coherency of > data seen in the caches as its a single client use-case. With these two > changes coupled with earlier optimizations (client-io-threads=on, > server/client-event-threads=4, md-cache as a child of write-behind in > xlator graph, performance.md-cache-timeout=600), pgbench init of scale 8000 > on a volume with NVMe backend completed in 54m25s. This is a whopping 94% > improvement to the time we started out with (59280s vs 3360s). > These numbers were taken from the latest run I had scheduled. However, I didn't notice that the test had failed midway. From another test that had completed successfully, the numbers are 139m7s. That will be an improvement of 86% (59280s vs 8340s). I've scheduled another run just to be sure. The improvement is 86% and not 94%. > [root@shakthi4 ~]# gluster volume info > > Volume Name: nvme-r3 > Type: Replicate > Volume ID: d1490bcc-bcf1-4e09-91e8-ab01d9781263 > Status: Started > Snapshot Count: 0 > Number of Bricks: 1 x 3 = 3 > Transport-type: tcp > Bricks: > Brick1: shakthi4:/gluster/nvme0n1/bricks/nvme-r3-1 > Brick2: shakthi4:/gluster/nvme0n1/bricks/nvme-r3-2 > Brick3: shakthi4:/gluster/nvme0n1/bricks/nvme-r3-3 > Options Reconfigured: > server.event-threads: 4 > client.event-threads: 4 > diagnostics.client-log-level: INFO > performance.md-cache-timeout: 600 > performance.io-cache: off > performance.read-ahead: off > diagnostics.count-fop-hits: on > diagnostics.latency-measurement: on > transport.address-family: inet > nfs.disable: on > performance.client-io-threads: on > performance.stat-prefetch: on > > I'll be concentrating on how to disable fuse-auto-invalidations without > regressing on the consistency model we've been providing till now. The > consistency model Glusterfs has been providing till now is close to open > consistency similar to what NFS provides [6][7]. > > But the initial thoughts are, at least for the pgbench test-case there is > no harm in totally disabling fuse-auto-invalidations and md-cache > invalidations as this workload totally runs on single mount point and hence > invalidations itself are not necessary as all I/O goes through caches and > hence caches are in sync with the state of the file on backend. > > [6] http://nfs.sourceforge.net/#faq_a8 > [7] > https://lists.gluster.org/pipermail/gluster-users/2013-March/012805.html > > >> [5] https://bugzilla.redhat.com/show_bug.cgi?id=1664934 >> >> regards, >> Raghavendra >> >> On Fri, Dec 28, 2018 at 12:44 PM Raghavendra Gowdappa < >> rgowd...@redhat.com> wrote: >> >>> >>> >>> On Mon, Dec 24, 2018 at 6:05 PM Raghavendra Gowdappa < >>> rgowd...@redhat.com> wrote: >>> >>>> >>>> >>>> On Mon, Dec 24, 2018 at 3:40 PM Sankarshan Mukhopadhyay < >>>> sankarshan.mukhopadh...@gmail.com> wrote: >>>> >>>>> [pulling the conclusions up to enable better in-line] >>>>> >>>>> > Conclusions: >>>>> > >>>>> > We should never have a volume with caching-related xlators disabled. >>>>> The price we pay for it is too high. We need to make them work >>>>> consistently >>>>> and aggressively to avoid as many requests as we can. >>>>> >>>>> Are there current issues in terms of behavior which are known/observed >>>>> when these are enabled? >>>>> >>>> >>>> We did have issues with pgbench in past. But they've have been fixed. >>>> Please refer to bz [1] for details. On 5.1, it runs successfully with all >>>> caching related xlators enabled. Having said that the only performance >>>> xlators which gave improved performance were open-behind and write-behind >>>> [2] (write-behind had some issues, which will be fixed by [3] and we'll >>>> have to measure performance again with fix to [3]). >>>> >>> >>> One quick update. Enabling write-behind and md-cache with fix for [3] >>> reduced the total time taken for pgbench init phase roughly by 20%-25% >>> (from 12.5 min to 9.75 min for a scale of 100). Though this is still a huge >>> time (around 12hrs for a db of scale 8000). I'll follow up with a detailed >>> report once my experiments are complete. Currently trying to optimize the >>> read path. >>> >>> >>>> For some reason, read-side caching didn't improve transactions per >>>> second. I am working on this problem currently. Note that these bugs >>>> measure transaction phase of pgbench, but what xavi measured in his mail is >>>> init phase. Nevertheless, evaluation of read caching (metadata/data) will >>>> still be relevant for init phase too. >>>> >>>> [1] https://bugzilla.redhat.com/show_bug.cgi?id=1512691 >>>> [2] https://bugzilla.redhat.com/show_bug.cgi?id=1629589#c4 >>>> [3] https://bugzilla.redhat.com/show_bug.cgi?id=1648781 >>>> >>>> >>>>> > We need to analyze client/server xlators deeper to see if we can >>>>> avoid some delays. However optimizing something that is already at the >>>>> microsecond level can be very hard. >>>>> >>>>> That is true - are there any significant gains which can be accrued by >>>>> putting efforts here or, should this be a lower priority? >>>>> >>>> >>>> The problem identified by xavi is also the one we (Manoj, Krutika, me >>>> and Milind) had encountered in the past [4]. The solution we used was to >>>> have multiple rpc connections between single brick and client. The solution >>>> indeed fixed the bottleneck. So, there is definitely work involved here - >>>> either to fix the single connection model or go with multiple connection >>>> model. Its preferred to improve single connection and resort to multiple >>>> connections only if bottlenecks in single connection are not fixable. >>>> Personally I think this is high priority along with having appropriate >>>> client side caching. >>>> >>>> [4] https://bugzilla.redhat.com/show_bug.cgi?id=1467614#c52 >>>> >>>> >>>>> > We need to determine what causes the fluctuations in brick side and >>>>> avoid them. >>>>> > This scenario is very similar to a smallfile/metadata workload, so >>>>> this is probably one important cause of its bad performance. >>>>> >>>>> What kind of instrumentation is required to enable the determination? >>>>> >>>>> On Fri, Dec 21, 2018 at 1:48 PM Xavi Hernandez <xhernan...@redhat.com> >>>>> wrote: >>>>> > >>>>> > Hi, >>>>> > >>>>> > I've done some tracing of the latency that network layer introduces >>>>> in gluster. I've made the analysis as part of the pgbench performance >>>>> issue >>>>> (in particulat the initialization and scaling phase), so I decided to look >>>>> at READV for this particular workload, but I think the results can be >>>>> extrapolated to other operations that also have small latency (cached data >>>>> from FS for example). >>>>> > >>>>> > Note that measuring latencies introduces some latency. It consists >>>>> in a call to clock_get_time() for each probe point, so the real latency >>>>> will be a bit lower, but still proportional to these numbers. >>>>> > >>>>> >>>>> [snip] >>>>> _______________________________________________ >>>>> Gluster-devel mailing list >>>>> Gluster-devel@gluster.org >>>>> https://lists.gluster.org/mailman/listinfo/gluster-devel >>>>> >>>>
_______________________________________________ Gluster-devel mailing list Gluster-devel@gluster.org https://lists.gluster.org/mailman/listinfo/gluster-devel