Re: [Gluster-devel] I/O performance

2019-01-31 Thread Vijay Bellur
On Thu, Jan 31, 2019 at 11:12 PM Xavi Hernandez 
wrote:

> On Fri, Feb 1, 2019 at 7:54 AM Vijay Bellur  wrote:
>
>>
>>
>> On Thu, Jan 31, 2019 at 10:01 AM Xavi Hernandez 
>> wrote:
>>
>>> Hi,
>>>
>>> I've been doing some tests with the global thread pool [1], and I've
>>> observed one important thing:
>>>
>>> Since this new thread pool has very low contention (apparently), it
>>> exposes other problems when the number of threads grows. What I've seen is
>>> that some workloads use all available threads on bricks to do I/O, causing
>>> avgload to grow rapidly and saturating the machine (or it seems so), which
>>> really makes everything slower. Reducing the maximum number of threads
>>> improves performance actually. Other workloads, though, do little I/O
>>> (probably most is locking or smallfile operations). In this case limiting
>>> the number of threads to a small value causes a performance reduction. To
>>> increase performance we need more threads.
>>>
>>> So this is making me thing that maybe we should implement some sort of
>>> I/O queue with a maximum I/O depth for each brick (or disk if bricks share
>>> same disk). This way we can limit the amount of requests physically
>>> accessing the underlying FS concurrently, without actually limiting the
>>> number of threads that can be doing other things on each brick. I think
>>> this could improve performance.
>>>
>>
>> Perhaps we could throttle both aspects - number of I/O requests per disk
>> and the number of threads too?  That way we will have the ability to behave
>> well when there is bursty I/O to the same disk and when there are multiple
>> concurrent requests to different disks. Do you have a reason to not limit
>> the number of threads?
>>
>
> No, in fact the global thread pool does have a limit for the number of
> threads. I'm not saying to replace the thread limit for I/O depth control,
> I think we need both. I think we need to clearly identify which threads are
> doing I/O and limit them, even if there are more threads available. The
> reason is easy: suppose we have a fixed number of threads. If we have heavy
> load sent in parallel, it's quite possible that all threads get blocked
> doing some I/O. This has two consequences:
>
>1. There are no more threads to execute other things, like sending
>answers to the client, or start processing new incoming requests. So CPU is
>underutilized.
>2. Massive parallel access to a FS actually decreases performance
>
> This means that we can do less work and this work takes more time, which
> is bad.
>
> If we limit the number of threads that can actually be doing FS I/O, it's
> easy to keep FS responsive and we'll still have more threads to do other
> work.
>


Got it, thx.


>
>
>>
>>> Maybe this approach could also be useful in client side, but I think
>>> it's not so critical there.
>>>
>>
>> Agree, rate limiting on the server side would be more appropriate.
>>
>
> Only thing to consider here is that if we limit rate on servers but
> clients can generate more requests without limit, we may require lots of
> memory to track all ongoing requests. Anyway, I think this is not the most
> important thing now, so if we solve the server-side problem, then we can
> check if this is really needed or not (it could happen that client
> applications limit themselves automatically because they will be waiting
> for answers from server before sending more requests, unless the number of
> application running concurrently is really huge).
>

We could enable throttling in the rpc layer to handle a client performing
aggressive I/O.  RPC throttling should be able to handle the scenario
described above.

-Vijay
___
Gluster-devel mailing list
Gluster-devel@gluster.org
https://lists.gluster.org/mailman/listinfo/gluster-devel

Re: [Gluster-devel] I/O performance

2019-01-31 Thread Xavi Hernandez
On Fri, Feb 1, 2019 at 7:54 AM Vijay Bellur  wrote:

>
>
> On Thu, Jan 31, 2019 at 10:01 AM Xavi Hernandez 
> wrote:
>
>> Hi,
>>
>> I've been doing some tests with the global thread pool [1], and I've
>> observed one important thing:
>>
>> Since this new thread pool has very low contention (apparently), it
>> exposes other problems when the number of threads grows. What I've seen is
>> that some workloads use all available threads on bricks to do I/O, causing
>> avgload to grow rapidly and saturating the machine (or it seems so), which
>> really makes everything slower. Reducing the maximum number of threads
>> improves performance actually. Other workloads, though, do little I/O
>> (probably most is locking or smallfile operations). In this case limiting
>> the number of threads to a small value causes a performance reduction. To
>> increase performance we need more threads.
>>
>> So this is making me thing that maybe we should implement some sort of
>> I/O queue with a maximum I/O depth for each brick (or disk if bricks share
>> same disk). This way we can limit the amount of requests physically
>> accessing the underlying FS concurrently, without actually limiting the
>> number of threads that can be doing other things on each brick. I think
>> this could improve performance.
>>
>
> Perhaps we could throttle both aspects - number of I/O requests per disk
> and the number of threads too?  That way we will have the ability to behave
> well when there is bursty I/O to the same disk and when there are multiple
> concurrent requests to different disks. Do you have a reason to not limit
> the number of threads?
>

No, in fact the global thread pool does have a limit for the number of
threads. I'm not saying to replace the thread limit for I/O depth control,
I think we need both. I think we need to clearly identify which threads are
doing I/O and limit them, even if there are more threads available. The
reason is easy: suppose we have a fixed number of threads. If we have heavy
load sent in parallel, it's quite possible that all threads get blocked
doing some I/O. This has two consequences:

   1. There are no more threads to execute other things, like sending
   answers to the client, or start processing new incoming requests. So CPU is
   underutilized.
   2. Massive parallel access to a FS actually decreases performance

This means that we can do less work and this work takes more time, which is
bad.

If we limit the number of threads that can actually be doing FS I/O, it's
easy to keep FS responsive and we'll still have more threads to do other
work.


>
>> Maybe this approach could also be useful in client side, but I think it's
>> not so critical there.
>>
>
> Agree, rate limiting on the server side would be more appropriate.
>

Only thing to consider here is that if we limit rate on servers but clients
can generate more requests without limit, we may require lots of memory to
track all ongoing requests. Anyway, I think this is not the most important
thing now, so if we solve the server-side problem, then we can check if
this is really needed or not (it could happen that client applications
limit themselves automatically because they will be waiting for answers
from server before sending more requests, unless the number of application
running concurrently is really huge).

Xavi
___
Gluster-devel mailing list
Gluster-devel@gluster.org
https://lists.gluster.org/mailman/listinfo/gluster-devel

Re: [Gluster-devel] I/O performance

2019-01-31 Thread Vijay Bellur
On Thu, Jan 31, 2019 at 10:01 AM Xavi Hernandez 
wrote:

> Hi,
>
> I've been doing some tests with the global thread pool [1], and I've
> observed one important thing:
>
> Since this new thread pool has very low contention (apparently), it
> exposes other problems when the number of threads grows. What I've seen is
> that some workloads use all available threads on bricks to do I/O, causing
> avgload to grow rapidly and saturating the machine (or it seems so), which
> really makes everything slower. Reducing the maximum number of threads
> improves performance actually. Other workloads, though, do little I/O
> (probably most is locking or smallfile operations). In this case limiting
> the number of threads to a small value causes a performance reduction. To
> increase performance we need more threads.
>
> So this is making me thing that maybe we should implement some sort of I/O
> queue with a maximum I/O depth for each brick (or disk if bricks share same
> disk). This way we can limit the amount of requests physically accessing
> the underlying FS concurrently, without actually limiting the number of
> threads that can be doing other things on each brick. I think this could
> improve performance.
>

Perhaps we could throttle both aspects - number of I/O requests per disk
and the number of threads too?  That way we will have the ability to behave
well when there is bursty I/O to the same disk and when there are multiple
concurrent requests to different disks. Do you have a reason to not limit
the number of threads?


> Maybe this approach could also be useful in client side, but I think it's
> not so critical there.
>

Agree, rate limiting on the server side would be more appropriate.


Thanks,
Vijay
___
Gluster-devel mailing list
Gluster-devel@gluster.org
https://lists.gluster.org/mailman/listinfo/gluster-devel

[Gluster-devel] I/O performance

2019-01-31 Thread Xavi Hernandez
Hi,

I've been doing some tests with the global thread pool [1], and I've
observed one important thing:

Since this new thread pool has very low contention (apparently), it exposes
other problems when the number of threads grows. What I've seen is that
some workloads use all available threads on bricks to do I/O, causing
avgload to grow rapidly and saturating the machine (or it seems so), which
really makes everything slower. Reducing the maximum number of threads
improves performance actually. Other workloads, though, do little I/O
(probably most is locking or smallfile operations). In this case limiting
the number of threads to a small value causes a performance reduction. To
increase performance we need more threads.

So this is making me thing that maybe we should implement some sort of I/O
queue with a maximum I/O depth for each brick (or disk if bricks share same
disk). This way we can limit the amount of requests physically accessing
the underlying FS concurrently, without actually limiting the number of
threads that can be doing other things on each brick. I think this could
improve performance.

Maybe this approach could also be useful in client side, but I think it's
not so critical there.

What do you think ?

Xavi

[1] https://review.gluster.org/c/glusterfs/+/20636
___
Gluster-devel mailing list
Gluster-devel@gluster.org
https://lists.gluster.org/mailman/listinfo/gluster-devel

Re: [Gluster-devel] Performance improvements

2019-01-31 Thread Xavi Hernandez
On Sun, Jan 27, 2019 at 8:03 AM Xavi Hernandez 
wrote:

> On Fri, 25 Jan 2019, 08:53 Vijay Bellur 
>> Thank you for the detailed update, Xavi! This looks very interesting.
>>
>> On Thu, Jan 24, 2019 at 7:50 AM Xavi Hernandez 
>> wrote:
>>
>>> Hi all,
>>>
>>> I've just updated a patch [1] that implements a new thread pool based on
>>> a wait-free queue provided by userspace-rcu library. The patch also
>>> includes an auto scaling mechanism that only keeps running the needed
>>> amount of threads for the current workload.
>>>
>>> This new approach has some advantages:
>>>
>>>- It's provided globally inside libglusterfs instead of inside an
>>>xlator
>>>
>>> This makes it possible that fuse thread and epoll threads transfer the
>>> received request to another thread sooner, wating less CPU and reacting
>>> sooner to other incoming requests.
>>>
>>>
>>>- Adding jobs to the queue used by the thread pool only requires an
>>>atomic operation
>>>
>>> This makes the producer side of the queue really fast, almost with no
>>> delay.
>>>
>>>
>>>- Contention is reduced
>>>
>>> The producer side has negligible contention thanks to the wait-free
>>> enqueue operation based on an atomic access. The consumer side requires a
>>> mutex, but the duration is very small and the scaling mechanism makes sure
>>> that there are no more threads than needed contending for the mutex.
>>>
>>>
>>> This change disables io-threads, since it replaces part of its
>>> functionality. However there are two things that could be needed from
>>> io-threads:
>>>
>>>- Prioritization of fops
>>>
>>> Currently, io-threads assigns priorities to each fop, so that some fops
>>> are handled before than others.
>>>
>>>
>>>- Fair distribution of execution slots between clients
>>>
>>> Currently, io-threads processes requests from each client in round-robin.
>>>
>>>
>>> These features are not implemented right now. If they are needed,
>>> probably the best thing to do would be to keep them inside io-threads, but
>>> change its implementation so that it uses the global threads from the
>>> thread pool instead of its own threads.
>>>
>>
>>
>> These features are indeed useful to have and hence modifying the
>> implementation of io-threads to provide this behavior would be welcome.
>>
>>
>>
>>>
>>>
>>> These tests have shown that the limiting factor has been the disk in
>>> most cases, so it's hard to tell if the change has really improved things.
>>> There is only one clear exception: self-heal on a dispersed volume
>>> completes 12.7% faster. The utilization of CPU has also dropped drastically:
>>>
>>> Old implementation: 12.30 user, 41.78 sys, 43.16 idle,  0.73 wait
>>>
>>> New implementation: 4.91 user,  5.52 sys, 81.60 idle,  5.91 wait
>>>
>>>
>>> Now I'm running some more tests on NVMe to try to see the effects of the
>>> change when disk is not limiting performance. I'll update once I've more
>>> data.
>>>
>>>
>> Will look forward to these numbers.
>>
>
> I have identified an issue that limits the number of active threads when
> load is high, causing some regressions. I'll fix it and rerun the tests on
> Monday.
>

Once the issue was solved, it caused high load averages for some workloads
that were actually causing a regression (too much I/O I guess) instead of
improving performance. So I added a configurable maximum amount of threads
and made the whole implementation optional, so that it can be safely used
when required.

I did some tests and I was able to, at least, have the same performance we
had before this patch in all cases. In some cases even better. But each
test needed a manual configuration on the number of threads.

I need to work on a way to automatically compute the maximum so that it can
be used easily in any workload (or even combined workloads).

I uploaded the latest version of the patch.

Xavi


> Xavi
>
>
>>
>> Regards,
>> Vijay
>>
>
___
Gluster-devel mailing list
Gluster-devel@gluster.org
https://lists.gluster.org/mailman/listinfo/gluster-devel