[Gluster-devel] [DHT] serialized readdir(p) across subvols and effect on performance

2018-12-31 Thread Raghavendra Gowdappa
All,

As many of us are aware, readdir(p)s are serialized across DHT subvols. One
of the intuitive first reactions for this algorithm is that readdir(p) is
going to be slow.

However this is partly true as reading the contents of a directory is
normally split into multiple readdir(p) calls and most of the times (when a
directory is sufficiently large to have dentries and inode data is bigger
than a typical readdir(p) buffer size - 128K when readdir-ahead is enabled
and 4KB on fuse when readdir-ahead is disabled - on each subvol) a single
readdir(p) request is served from a single subvolume (or two subvolumes in
the worst case) and hence a single readdir(p) is not serialized across all
subvolumes.

Having said that, there are definitely cases where a single readdir(p)
request can be serialized on many subvolumes. A best example for this is a
readdir(p) request on an empty directory. Other relevant examples are those
directories which don't have enough dentries to fit into a single
readdir(p) buffer size on each subvolume of DHT. This is where
performance.parallel-readdir helps. Also, note that this is the same reason
why having cache-size for each readdir-ahead (loaded as a parent for each
DHT subvolume) way bigger than a single readdir(p) buffer size won't really
improve the performance in proportion to cache-size when
performance.parallel-readdir is enabled.

Though this is not a new observation [1] (I stumbled upon [1] after
realizing the above myself independently while working on
performance.parallel-readdir), I felt this as a common misconception (I ran
into similar argument while trying to explain DHT architecture to someone
new to Glusterfs recently) and hence thought of writing out a mail to
clarify the same.

Wish you a happy new year 2019 :).

[1] https://lists.gnu.org/archive/html/gluster-devel/2013-09/msg00034.html

regards,
Raghavendra
___
Gluster-devel mailing list
Gluster-devel@gluster.org
https://lists.gluster.org/mailman/listinfo/gluster-devel

Re: [Gluster-devel] Latency analysis of GlusterFS' network layer for pgbench

2018-12-31 Thread Sankarshan Mukhopadhyay
On Fri 28 Dec, 2018, 12:44 Raghavendra Gowdappa 
>
> On Mon, Dec 24, 2018 at 6:05 PM Raghavendra Gowdappa 
> wrote:
>
>>
>>
>> On Mon, Dec 24, 2018 at 3:40 PM Sankarshan Mukhopadhyay <
>> sankarshan.mukhopadh...@gmail.com> wrote:
>>
>>> [pulling the conclusions up to enable better in-line]
>>>
>>> > Conclusions:
>>> >
>>> > We should never have a volume with caching-related xlators disabled.
>>> The price we pay for it is too high. We need to make them work consistently
>>> and aggressively to avoid as many requests as we can.
>>>
>>> Are there current issues in terms of behavior which are known/observed
>>> when these are enabled?
>>>
>>
>> We did have issues with pgbench in past. But they've have been fixed.
>> Please refer to bz [1] for details. On 5.1, it runs successfully with all
>> caching related xlators enabled. Having said that the only performance
>> xlators which gave improved performance were open-behind and write-behind
>> [2] (write-behind had some issues, which will be fixed by [3] and we'll
>> have to measure performance again with fix to [3]).
>>
>
> One quick update. Enabling write-behind and md-cache with fix for [3]
> reduced the total time taken for pgbench init phase roughly by 20%-25%
> (from 12.5 min to 9.75 min for a scale of 100). Though this is still a huge
> time (around 12hrs for a db of scale 8000). I'll follow up with a detailed
> report once my experiments are complete. Currently trying to optimize the
> read path.
>
>
>> For some reason, read-side caching didn't improve transactions per
>> second. I am working on this problem currently. Note that these bugs
>> measure transaction phase of pgbench, but what xavi measured in his mail is
>> init phase. Nevertheless, evaluation of read caching (metadata/data) will
>> still be relevant for init phase too.
>>
>> [1] https://bugzilla.redhat.com/show_bug.cgi?id=1512691
>> [2] https://bugzilla.redhat.com/show_bug.cgi?id=1629589#c4
>> [3] https://bugzilla.redhat.com/show_bug.cgi?id=1648781
>>
>
I think that what I am looking forward to is a well defined set of next
steps and potential update to this list windows that eventually result in a
formal and recorded procedure to ensure that Gluster performs best for
these application workloads.


>>
>>
>>> > We need to analyze client/server xlators deeper to see if we can avoid
>>> some delays. However optimizing something that is already at the
>>> microsecond level can be very hard.
>>>
>>> That is true - are there any significant gains which can be accrued by
>>> putting efforts here or, should this be a lower priority?
>>>
>>
>> The problem identified by xavi is also the one we (Manoj, Krutika, me and
>> Milind) had encountered in the past [4]. The solution we used was to have
>> multiple rpc connections between single brick and client. The solution
>> indeed fixed the bottleneck. So, there is definitely work involved here -
>> either to fix the single connection model or go with multiple connection
>> model. Its preferred to improve single connection and resort to multiple
>> connections only if bottlenecks in single connection are not fixable.
>> Personally I think this is high priority along with having appropriate
>> client side caching.
>>
>> [4] https://bugzilla.redhat.com/show_bug.cgi?id=1467614#c52
>>
>>
>>> > We need to determine what causes the fluctuations in brick side and
>>> avoid them.
>>> > This scenario is very similar to a smallfile/metadata workload, so
>>> this is probably one important cause of its bad performance.
>>>
>>> What kind of instrumentation is required to enable the determination?
>>>
>>> On Fri, Dec 21, 2018 at 1:48 PM Xavi Hernandez 
>>> wrote:
>>> >
>>> > Hi,
>>> >
>>> > I've done some tracing of the latency that network layer introduces in
>>> gluster. I've made the analysis as part of the pgbench performance issue
>>> (in particulat the initialization and scaling phase), so I decided to look
>>> at READV for this particular workload, but I think the results can be
>>> extrapolated to other operations that also have small latency (cached data
>>> from FS for example).
>>> >
>>> > Note that measuring latencies introduces some latency. It consists in
>>> a call to clock_get_time() for each probe point, so the real latency will
>>> be a bit lower, but still proportional to these numbers.
>>> >
>>>
>>> [snip]
>>> ___
>>> Gluster-devel mailing list
>>> Gluster-devel@gluster.org
>>> https://lists.gluster.org/mailman/listinfo/gluster-devel
>>>
>>
___
Gluster-devel mailing list
Gluster-devel@gluster.org
https://lists.gluster.org/mailman/listinfo/gluster-devel