Re: [Gluster-devel] [Gluster-users] Error in gluster v11

2023-05-17 Thread Xavi Hernandez
On Tue, May 16, 2023 at 4:00 PM Gilberto Ferreira <
gilberto.nune...@gmail.com> wrote:

> Hi again
> I just noticed that there is some updates from glusterd
>
> apt list --upgradable
> Listing... Done
> glusterfs-client/unknown 11.0-2 amd64 [upgradable from: 11.0-1]
> glusterfs-common/unknown 11.0-2 amd64 [upgradable from: 11.0-1]
> glusterfs-server/unknown 11.0-2 amd64 [upgradable from: 11.0-1]
> libgfapi0/unknown 11.0-2 amd64 [upgradable from: 11.0-1]
> libgfchangelog0/unknown 11.0-2 amd64 [upgradable from: 11.0-1]
> libgfrpc0/unknown 11.0-2 amd64 [upgradable from: 11.0-1]
> libgfxdr0/unknown 11.0-2 amd64 [upgradable from: 11.0-1]
> libglusterfs0/unknown 11.0-2 amd64 [upgradable from: 11.0-1]
>
> Perhaps this could fix the issue?
>

No. I think this is just to fix a packaging problem in the latest version.
The patch won't be included in any official version until it's properly
tested and merged to the main code. Hopefully the reporter of the GitHub
issue will be able to test it so that it can be verified and provided in
the next 11.x release.

Regards,

Xavi

---
> Gilberto Nunes Ferreira
> (47) 99676-7530 - Whatsapp / Telegram
>
>
>
>
>
>
> Em ter., 16 de mai. de 2023 às 09:31, Gilberto Ferreira <
> gilberto.nune...@gmail.com> escreveu:
>
>> Ok. No problem. I can test it in a virtual environment.
>> Send me the path.
>> Oh but the way, I don't compile gluster from scratch.
>> I was used the deb file from
>> https://download.gluster.org/pub/gluster/glusterfs/LATEST/Debian/
>>
>> ---
>> Gilberto Nunes Ferreira
>> (47) 99676-7530 - Whatsapp / Telegram
>>
>>
>>
>>
>>
>>
>> Em ter., 16 de mai. de 2023 às 09:21, Xavi Hernandez 
>> escreveu:
>>
>>> Hi Gilberto,
>>>
>>> On Tue, May 16, 2023 at 12:56 PM Gilberto Ferreira <
>>> gilberto.nune...@gmail.com> wrote:
>>>
>>>> Hi Xavi
>>>> That's depend. Is it safe? I have this env production you know???
>>>>
>>>
>>> It should be safe, but I wouldn't test it on production. Can't you try
>>> it in any test environment before ?
>>>
>>> Xavi
>>>
>>>
>>>>
>>>> ---
>>>> Gilberto Nunes Ferreira
>>>> (47) 99676-7530 - Whatsapp / Telegram
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> Em ter., 16 de mai. de 2023 às 07:45, Xavi Hernandez <
>>>> jaher...@redhat.com> escreveu:
>>>>
>>>>> The referenced GitHub issue now has a potential patch that could fix
>>>>> the problem, though it will need to be verified. Could you try to apply 
>>>>> the
>>>>> patch and check if the problem persists ?
>>>>>
>>>>> On Mon, May 15, 2023 at 2:10 AM Gilberto Ferreira <
>>>>> gilberto.nune...@gmail.com> wrote:
>>>>>
>>>>>> Hi there, anyone in the Gluster Devel list.
>>>>>>
>>>>>> Any fix about this issue?
>>>>>>
>>>>>> May 14 07:05:39 srv01 vms[9404]: [2023-05-14 10:05:39.618424 +] C
>>>>>> [gf-io-uring.c:612:gf_io_uring_cq_process_some] (-->/lib/x86_64
>>>>>> -linux-gnu/libglusterfs.so.0(+0x849ae) [0x7fb4ebace9ae]
>>>>>> -->/lib/x86_64-linux-gnu/libglusterfs.so.0(+0x8a2e5)
>>>>>> [0x7fb4ebad42e5] -->/lib
>>>>>> /x86_64-linux-gnu/libglusterfs.so.0(+0x8a1a5) [0x7fb4ebad41a5] ) 0-:
>>>>>> Assertion failed:
>>>>>> May 14 07:05:39 srv01 vms[9404]: patchset: git://
>>>>>> git.gluster.org/glusterfs.git
>>>>>> May 14 07:05:39 srv01 vms[9404]: package-string: glusterfs 11.0
>>>>>> ---
>>>>>> Gilberto Nunes Ferreira
>>>>>> (47) 99676-7530 - Whatsapp / Telegram
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> Em dom., 14 de mai. de 2023 às 16:53, Strahil Nikolov <
>>>>>> hunter86...@yahoo.com> escreveu:
>>>>>>
>>>>>>> Looks similar to https://github.com/gluster/glusterfs/issues/4104
>>>>>>> I don’t see any progress there.
>>>>>>> Maybe asking in gluster-devel (in CC) could help.
>>>>>>>
>>>>>>> Best Regards,
>>>>>>> Strahil Nikolov
>>>>>>>
>>>>>

Re: [Gluster-devel] [Gluster-users] Error in gluster v11

2023-05-16 Thread Xavi Hernandez
Hi Gilberto,

On Tue, May 16, 2023 at 12:56 PM Gilberto Ferreira <
gilberto.nune...@gmail.com> wrote:

> Hi Xavi
> That's depend. Is it safe? I have this env production you know???
>

It should be safe, but I wouldn't test it on production. Can't you try it
in any test environment before ?

Xavi


>
> ---
> Gilberto Nunes Ferreira
> (47) 99676-7530 - Whatsapp / Telegram
>
>
>
>
>
>
> Em ter., 16 de mai. de 2023 às 07:45, Xavi Hernandez 
> escreveu:
>
>> The referenced GitHub issue now has a potential patch that could fix the
>> problem, though it will need to be verified. Could you try to apply the
>> patch and check if the problem persists ?
>>
>> On Mon, May 15, 2023 at 2:10 AM Gilberto Ferreira <
>> gilberto.nune...@gmail.com> wrote:
>>
>>> Hi there, anyone in the Gluster Devel list.
>>>
>>> Any fix about this issue?
>>>
>>> May 14 07:05:39 srv01 vms[9404]: [2023-05-14 10:05:39.618424 +] C
>>> [gf-io-uring.c:612:gf_io_uring_cq_process_some] (-->/lib/x86_64
>>> -linux-gnu/libglusterfs.so.0(+0x849ae) [0x7fb4ebace9ae]
>>> -->/lib/x86_64-linux-gnu/libglusterfs.so.0(+0x8a2e5) [0x7fb4ebad42e5]
>>> -->/lib
>>> /x86_64-linux-gnu/libglusterfs.so.0(+0x8a1a5) [0x7fb4ebad41a5] ) 0-:
>>> Assertion failed:
>>> May 14 07:05:39 srv01 vms[9404]: patchset: git://
>>> git.gluster.org/glusterfs.git
>>> May 14 07:05:39 srv01 vms[9404]: package-string: glusterfs 11.0
>>> ---
>>> Gilberto Nunes Ferreira
>>> (47) 99676-7530 - Whatsapp / Telegram
>>>
>>>
>>>
>>>
>>>
>>>
>>> Em dom., 14 de mai. de 2023 às 16:53, Strahil Nikolov <
>>> hunter86...@yahoo.com> escreveu:
>>>
>>>> Looks similar to https://github.com/gluster/glusterfs/issues/4104
>>>> I don’t see any progress there.
>>>> Maybe asking in gluster-devel (in CC) could help.
>>>>
>>>> Best Regards,
>>>> Strahil Nikolov
>>>>
>>>>
>>>> On Sunday, May 14, 2023, 5:28 PM, Gilberto Ferreira <
>>>> gilberto.nune...@gmail.com> wrote:
>>>>
>>>> Anybody also has this error?
>>>>
>>>> May 14 07:05:39 srv01 vms[9404]: [2023-05-14 10:05:39.618424 +] C
>>>> [gf-io-uring.c:612:gf_io_uring_cq_process_some] (-->/lib/x86_64
>>>> -linux-gnu/libglusterfs.so.0(+0x849ae) [0x7fb4ebace9ae]
>>>> -->/lib/x86_64-linux-gnu/libglusterfs.so.0(+0x8a2e5) [0x7fb4ebad42e5]
>>>> -->/lib
>>>> /x86_64-linux-gnu/libglusterfs.so.0(+0x8a1a5) [0x7fb4ebad41a5] ) 0-:
>>>> Assertion failed:
>>>> May 14 07:05:39 srv01 vms[9404]: patchset: git://
>>>> git.gluster.org/glusterfs.git
>>>> May 14 07:05:39 srv01 vms[9404]: package-string: glusterfs 11.0
>>>>
>>>> ---
>>>> Gilberto Nunes Ferreira
>>>> (47) 99676-7530 - Whatsapp / Telegram
>>>>
>>>>
>>>>
>>>>
>>>> 
>>>>
>>>>
>>>>
>>>> Community Meeting Calendar:
>>>>
>>>> Schedule -
>>>> Every 2nd and 4th Tuesday at 14:30 IST / 09:00 UTC
>>>> Bridge: https://meet.google.com/cpu-eiue-hvk
>>>> Gluster-users mailing list
>>>> gluster-us...@gluster.org
>>>> https://lists.gluster.org/mailman/listinfo/gluster-users
>>>>
>>>> ---
>>>
>>> Community Meeting Calendar:
>>> Schedule -
>>> Every 2nd and 4th Tuesday at 14:30 IST / 09:00 UTC
>>> Bridge: https://meet.google.com/cpu-eiue-hvk
>>>
>>> Gluster-devel mailing list
>>> Gluster-devel@gluster.org
>>> https://lists.gluster.org/mailman/listinfo/gluster-devel
>>>
>>>
---

Community Meeting Calendar:
Schedule -
Every 2nd and 4th Tuesday at 14:30 IST / 09:00 UTC
Bridge: https://meet.google.com/cpu-eiue-hvk

Gluster-devel mailing list
Gluster-devel@gluster.org
https://lists.gluster.org/mailman/listinfo/gluster-devel



Re: [Gluster-devel] [Gluster-users] Error in gluster v11

2023-05-16 Thread Xavi Hernandez
The referenced GitHub issue now has a potential patch that could fix the
problem, though it will need to be verified. Could you try to apply the
patch and check if the problem persists ?

On Mon, May 15, 2023 at 2:10 AM Gilberto Ferreira <
gilberto.nune...@gmail.com> wrote:

> Hi there, anyone in the Gluster Devel list.
>
> Any fix about this issue?
>
> May 14 07:05:39 srv01 vms[9404]: [2023-05-14 10:05:39.618424 +] C
> [gf-io-uring.c:612:gf_io_uring_cq_process_some] (-->/lib/x86_64
> -linux-gnu/libglusterfs.so.0(+0x849ae) [0x7fb4ebace9ae]
> -->/lib/x86_64-linux-gnu/libglusterfs.so.0(+0x8a2e5) [0x7fb4ebad42e5]
> -->/lib
> /x86_64-linux-gnu/libglusterfs.so.0(+0x8a1a5) [0x7fb4ebad41a5] ) 0-:
> Assertion failed:
> May 14 07:05:39 srv01 vms[9404]: patchset: git://
> git.gluster.org/glusterfs.git
> May 14 07:05:39 srv01 vms[9404]: package-string: glusterfs 11.0
> ---
> Gilberto Nunes Ferreira
> (47) 99676-7530 - Whatsapp / Telegram
>
>
>
>
>
>
> Em dom., 14 de mai. de 2023 às 16:53, Strahil Nikolov <
> hunter86...@yahoo.com> escreveu:
>
>> Looks similar to https://github.com/gluster/glusterfs/issues/4104
>> I don’t see any progress there.
>> Maybe asking in gluster-devel (in CC) could help.
>>
>> Best Regards,
>> Strahil Nikolov
>>
>>
>> On Sunday, May 14, 2023, 5:28 PM, Gilberto Ferreira <
>> gilberto.nune...@gmail.com> wrote:
>>
>> Anybody also has this error?
>>
>> May 14 07:05:39 srv01 vms[9404]: [2023-05-14 10:05:39.618424 +] C
>> [gf-io-uring.c:612:gf_io_uring_cq_process_some] (-->/lib/x86_64
>> -linux-gnu/libglusterfs.so.0(+0x849ae) [0x7fb4ebace9ae]
>> -->/lib/x86_64-linux-gnu/libglusterfs.so.0(+0x8a2e5) [0x7fb4ebad42e5]
>> -->/lib
>> /x86_64-linux-gnu/libglusterfs.so.0(+0x8a1a5) [0x7fb4ebad41a5] ) 0-:
>> Assertion failed:
>> May 14 07:05:39 srv01 vms[9404]: patchset: git://
>> git.gluster.org/glusterfs.git
>> May 14 07:05:39 srv01 vms[9404]: package-string: glusterfs 11.0
>>
>> ---
>> Gilberto Nunes Ferreira
>> (47) 99676-7530 - Whatsapp / Telegram
>>
>>
>>
>>
>> 
>>
>>
>>
>> Community Meeting Calendar:
>>
>> Schedule -
>> Every 2nd and 4th Tuesday at 14:30 IST / 09:00 UTC
>> Bridge: https://meet.google.com/cpu-eiue-hvk
>> Gluster-users mailing list
>> gluster-us...@gluster.org
>> https://lists.gluster.org/mailman/listinfo/gluster-users
>>
>> ---
>
> Community Meeting Calendar:
> Schedule -
> Every 2nd and 4th Tuesday at 14:30 IST / 09:00 UTC
> Bridge: https://meet.google.com/cpu-eiue-hvk
>
> Gluster-devel mailing list
> Gluster-devel@gluster.org
> https://lists.gluster.org/mailman/listinfo/gluster-devel
>
>
---

Community Meeting Calendar:
Schedule -
Every 2nd and 4th Tuesday at 14:30 IST / 09:00 UTC
Bridge: https://meet.google.com/cpu-eiue-hvk

Gluster-devel mailing list
Gluster-devel@gluster.org
https://lists.gluster.org/mailman/listinfo/gluster-devel



Re: [Gluster-devel] [Gluster-Maintainers] Release 11: Revisting our proposed timeline and features

2022-10-17 Thread Xavi Hernandez
On Mon, Oct 17, 2022 at 10:40 AM Yaniv Kaul  wrote:

>
>
> On Mon, Oct 17, 2022 at 8:41 AM Xavi Hernandez 
> wrote:
>
>> On Mon, Oct 17, 2022 at 4:03 AM Amar Tumballi  wrote:
>>
>>> Here is my honest take on this one.
>>>
>>> On Tue, Oct 11, 2022 at 3:06 PM Shwetha Acharya 
>>> wrote:
>>>
>>>> It is time to evaluate the fulfillment of our committed
>>>> features/improvements and the feasibility of the proposed deadlines as per 
>>>> Release
>>>> 11 tracker <https://github.com/gluster/glusterfs/issues/3023>.
>>>>
>>>>
>>>> Currently our timeline is as follows:
>>>>
>>>> Code Freeze: 31-Oct-2022
>>>> RC : 30-Nov-2022
>>>> GA : 10-JAN-2023
>>>>
>>>> *Please evaluate the following and reply to this thread if you want to
>>>> convey anything important:*
>>>>
>>>> - Can we ensure to fulfill all the proposed requirements by the Code
>>>> Freeze?
>>>> - Do we need to add any more changes to accommodate any shortcomings or
>>>> improvements?
>>>> - Are we all good to go with the proposed timeline?
>>>>
>>>>
>>> We have delayed the release already by more than 1year, and that is a
>>> significant delay for any project. If the changes we work on is not getting
>>> released frequently, the feedback loop for the project is delayed and hence
>>> the further improvements. So, regardless of any pending promised things, we
>>> should go ahead with the code-freeze and release on these dates.
>>>
>>> It is crucial for any projects / companies dependent on the project to
>>> plan accordingly. There may be already few others who would have planned
>>> their product release around these dates. Lets keep the same dates, and try
>>> to achieve the tasks we have planned in these dates.
>>>
>>
>> I agree. Pending changes will need to be added to next release. Doing it
>> at last time is not safe for stability.
>>
>
> Generally, +1.
>
> - Some info on my in-flight PRs:
>
> I have multiple independent patches for the flexible array member
> conversion of different variables that are pending:
> https://github.com/gluster/glusterfs/pull/3873
> https://github.com/gluster/glusterfs/pull/3872
> https://github.com/gluster/glusterfs/pull/3868  (this one is particularly
> interesting, I hope it works!)
> https://github.com/gluster/glusterfs/pull/3861
> https://github.com/gluster/glusterfs/pull/3870 (already in review,
> perhaps it can get it soon?)
>

I'm already looking at these and I expect they can be merged before the
current code-freeze date.


> I have this for one for inode related code, which got some attention
> recently:
> https://github.com/gluster/glusterfs/pull/3226
>

I'll try to review this one before code-freeze, but it requires much more
care. Any help will be appreciated.


>
> I think this one is worthwhile looking at:
> https://github.com/gluster/glusterfs/pull/3854
>

I'll try to take a look at this one also.


> I wish we could get rid of old, unsupported versions:
> https://github.com/gluster/glusterfs/pull/3544
> (there's more to do, in different patches, but it's a start)
>

This one is mostly ok, but I think we can't release a new version without
an explicit check for unsupported versions at least at the beginning, to
avoid problems when users upgrade directly from 3.x to 11.x.


> None of them is critical for release 11, though I'm unsure if I'll have
> the ability to complete them later.
>
>
> - The lack of EL9 official support (inc. testing infra.) is regrettable,
> and I think something worth fixing *before* release 11 - adding sanity on
> newer OS releases, which will use io_uring for example, is something we
> should definitely consider.
>
> Lastly, I thought zstandard compression to the CDC xlator is interesting
> for 11 (https://github.com/gluster/glusterfs/pull/3841) - unsure if it's
> ready for inclusion, but overall impact for stability should be low,
> considered this is not a fully supported xlator anyway (and then
> https://github.com/gluster/glusterfs/pull/3835 should / could be
> considered as well).
>

I already started the review but I'm not very familiarized with cdc and the
compression libraries, so I'll need some more time.


>
> Last though:
> If we are just time-based - sure, there's value in going forward and
> releasing it - there are hundreds (or more) of great patches already
> merged, I think there's value here.
> If we wish to look at features and impactful changes to the users 

Re: [Gluster-devel] [Gluster-Maintainers] Release 11: Revisting our proposed timeline and features

2022-10-16 Thread Xavi Hernandez
On Mon, Oct 17, 2022 at 4:03 AM Amar Tumballi  wrote:

> Here is my honest take on this one.
>
> On Tue, Oct 11, 2022 at 3:06 PM Shwetha Acharya 
> wrote:
>
>> It is time to evaluate the fulfillment of our committed
>> features/improvements and the feasibility of the proposed deadlines as per 
>> Release
>> 11 tracker .
>>
>>
>> Currently our timeline is as follows:
>>
>> Code Freeze: 31-Oct-2022
>> RC : 30-Nov-2022
>> GA : 10-JAN-2023
>>
>> *Please evaluate the following and reply to this thread if you want to
>> convey anything important:*
>>
>> - Can we ensure to fulfill all the proposed requirements by the Code
>> Freeze?
>> - Do we need to add any more changes to accommodate any shortcomings or
>> improvements?
>> - Are we all good to go with the proposed timeline?
>>
>>
> We have delayed the release already by more than 1year, and that is a
> significant delay for any project. If the changes we work on is not getting
> released frequently, the feedback loop for the project is delayed and hence
> the further improvements. So, regardless of any pending promised things, we
> should go ahead with the code-freeze and release on these dates.
>
> It is crucial for any projects / companies dependent on the project to
> plan accordingly. There may be already few others who would have planned
> their product release around these dates. Lets keep the same dates, and try
> to achieve the tasks we have planned in these dates.
>

I agree. Pending changes will need to be added to next release. Doing it at
last time is not safe for stability.

Xavi
---

Community Meeting Calendar:
Schedule -
Every 2nd and 4th Tuesday at 14:30 IST / 09:00 UTC
Bridge: https://meet.google.com/cpu-eiue-hvk

Gluster-devel mailing list
Gluster-devel@gluster.org
https://lists.gluster.org/mailman/listinfo/gluster-devel



Re: [Gluster-devel] New logging interface

2022-03-24 Thread Xavi Hernandez
Hi Strahil,

On Thu, Mar 24, 2022 at 8:26 PM Strahil Nikolov 
wrote:

> Hey Xavi,
>
> Did anyone measure performance behavior before and after the changes?
>

I haven't tested performance for this change, but I don't expect any
appreciable variation. The main reason to do it is to provide a simpler way
to create and use log messages that makes them more flexible and
consistent. It's specially useful when used with an editor that support
code completion.

Given that I've rewritten a significant part of the code, I've taken the
opportunity to include some things that could have a minimal performance
benefit, but it's not the main reason.

Best regards,

Xavi


> Best Regards,
> Strahil Nikolov
>
> On Thu, Mar 24, 2022 at 20:33, Xavi Hernandez
>  wrote:
> Hi all,
>
> I've just posted a proposal for a new logging interface here:
> https://github.com/gluster/glusterfs/pull/3342
>
> There are many comments and the documentation is updated in the PR itself,
> so I won't duplicate all the info here. Please check it if you are
> interested in the details.
>
> As a summary, I think that the new interface is easier to use, more
> powerful, more flexible and more robust.
>
> Since it affects an interface used by every single component of Gluster I
> would like to have some more feedback before deciding whether we merge it
> or not. Feel free to comment here or in the PR itself.
>
> Thank you very much,
>
> Xavi
> ---
>
> Community Meeting Calendar:
> Schedule -
> Every 2nd and 4th Tuesday at 14:30 IST / 09:00 UTC
> Bridge: https://meet.google.com/cpu-eiue-hvk
>
> Gluster-devel mailing list
> Gluster-devel@gluster.org
> https://lists.gluster.org/mailman/listinfo/gluster-devel
>
>
---

Community Meeting Calendar:
Schedule -
Every 2nd and 4th Tuesday at 14:30 IST / 09:00 UTC
Bridge: https://meet.google.com/cpu-eiue-hvk

Gluster-devel mailing list
Gluster-devel@gluster.org
https://lists.gluster.org/mailman/listinfo/gluster-devel



[Gluster-devel] New logging interface

2022-03-24 Thread Xavi Hernandez
Hi all,

I've just posted a proposal for a new logging interface here:
https://github.com/gluster/glusterfs/pull/3342

There are many comments and the documentation is updated in the PR itself,
so I won't duplicate all the info here. Please check it if you are
interested in the details.

As a summary, I think that the new interface is easier to use, more
powerful, more flexible and more robust.

Since it affects an interface used by every single component of Gluster I
would like to have some more feedback before deciding whether we merge it
or not. Feel free to comment here or in the PR itself.

Thank you very much,

Xavi
---

Community Meeting Calendar:
Schedule -
Every 2nd and 4th Tuesday at 14:30 IST / 09:00 UTC
Bridge: https://meet.google.com/cpu-eiue-hvk

Gluster-devel mailing list
Gluster-devel@gluster.org
https://lists.gluster.org/mailman/listinfo/gluster-devel



Re: [Gluster-devel] [Gluster-users] Fw: Distributed-Disperse Shard Behavior

2022-02-09 Thread Xavi Hernandez
Hi,

this problem is most likely caused by the XFS speculative preallocation (
https://linux-xfs.oss.sgi.narkive.com/jjjfnyI1/faq-xfs-speculative-preallocation
)

Regards,

Xavi

On Sat, Feb 5, 2022 at 10:19 AM Strahil Nikolov 
wrote:

> It seems quite odd.
> I'm adding the devel list,as it looks like a bug - but it could be a
> feature ;)
>
> Best Regards,
> Strahil Nikolov
>
>
> - Препратено съобщение -
> *От:* Fox 
> *До:* Gluster Users 
> *Изпратено:* събота, 5 февруари 2022 г., 05:39:36 Гринуич+2
> *Тема:* Re: [Gluster-users] Distributed-Disperse Shard Behavior
>
> I tried setting the shard size to 512MB. It slightly improved the space
> utilization during creation - not quite double space utilization. And I
> didn't run out of space creating a file that occupied 6gb of the 8gb volume
> (and I even tried 7168MB just fine). See attached command line log.
>
> On Fri, Feb 4, 2022 at 6:59 PM Strahil Nikolov 
> wrote:
>
> It sounds like a bug to me.
> In virtualization sharding is quite common (yet, on replica volumes) and I
> have never observed such behavior.
> Can you increase the shard size to 512M and check if the situation is
> better ?
> Also, share the volume info.
>
> Best Regards,
> Strahil Nikolov
>
> On Fri, Feb 4, 2022 at 22:32, Fox
>  wrote:
> Using gluster v10.1 and creating a Distributed-Dispersed volume with
> sharding enabled.
>
> I create a 2gb file on the volume using the 'dd' tool. The file size shows
> 2gb with 'ls'. However, 'df' shows 4gb of space utilized on the volume.
> After several minutes the volume utilization drops to the 2gb I would
> expect.
>
> This is repeatable for different large file sizes and different
> disperse/redundancy brick configurations.
>
> I've also encountered a situation, as configured above, where I utilize
> close to full disk capacity and am momentarily unable to delete the file.
>
> I have attached a command line log of an example of above using a set of
> test VMs setup in a glusterfs cluster.
>
> Is this initial 2x space utilization anticipated behavior for sharding?
>
> It would mean that I can never create a file bigger than half my volume
> size as I get an I/O error with no space left on disk.
> 
>
>
>
> Community Meeting Calendar:
>
> Schedule -
> Every 2nd and 4th Tuesday at 14:30 IST / 09:00 UTC
> Bridge: https://meet.google.com/cpu-eiue-hvk
> Gluster-users mailing list
> gluster-us...@gluster.org
> https://lists.gluster.org/mailman/listinfo/gluster-users
>
> 
>
>
>
> Community Meeting Calendar:
>
> Schedule -
> Every 2nd and 4th Tuesday at 14:30 IST / 09:00 UTC
> Bridge: https://meet.google.com/cpu-eiue-hvk
> Gluster-users mailing list
> gluster-us...@gluster.org
> https://lists.gluster.org/mailman/listinfo/gluster-users
> 
>
>
>
> Community Meeting Calendar:
>
> Schedule -
> Every 2nd and 4th Tuesday at 14:30 IST / 09:00 UTC
> Bridge: https://meet.google.com/cpu-eiue-hvk
> Gluster-users mailing list
> gluster-us...@gluster.org
> https://lists.gluster.org/mailman/listinfo/gluster-users
>
---

Community Meeting Calendar:
Schedule -
Every 2nd and 4th Tuesday at 14:30 IST / 09:00 UTC
Bridge: https://meet.google.com/cpu-eiue-hvk

Gluster-devel mailing list
Gluster-devel@gluster.org
https://lists.gluster.org/mailman/listinfo/gluster-devel



Re: [Gluster-devel] [Smallfile-Replica-3] Performance report for Gluster Upstream - 30/12/2021 Test Status: FAIL (-7.91%)

2021-12-29 Thread Xavi Hernandez
On Thu, Dec 30, 2021 at 5:50 AM Amar Tumballi  wrote:

> Any PR to suspect here?
>

The previous execution that passed was based on commit 12b44fe. This one is
based on commit b8e32c3. The only commit between them is b8e32c3, but it
seems unlikely that it may affect non SSL connections.

It seems more like an issue during the execution.


>
> On Thu, Dec 30, 2021 at 6:25 AM Gluster-jenkins <
> gluster-jenk...@redhat.com> wrote:
>
>> *Test details:*
>> RPM Location: Upstream
>> OS Version: Red-Hat-Enterprise-Linux 8.4-(Ootpa)
>> Baseline Gluster version: glusterfs-10.0-1
>> Current Gluster version: glusterfs-20211228.b8e32c3-0.0
>> Intermediate Gluster version: No intermediate baseline
>> Test type: Smallfile
>> Tool: smallfile
>> Volume type: Replica-3
>> Volume Option: No volume options configured
>> FOPsBaselineDailyBuildBaseline vs DailyBuild
>> create 15586 15791 1
>> ls-l 229506 228038 0
>> chmod 24545 20445 -16
>> stat 35376 25572 -27
>> read 29000 23453 -19
>> append 13850 10562 -23
>> rename 958 980 2
>> delete-renamed 22512 21956 -2
>> mkdir 3212 3204 0
>> rmdir 2691 2676 0
>> cleanup 9564 9181 -3
>> ---
>>
>> Community Meeting Calendar:
>> Schedule -
>> Every 2nd and 4th Tuesday at 14:30 IST / 09:00 UTC
>> Bridge: https://meet.google.com/cpu-eiue-hvk
>>
>> Gluster-devel mailing list
>> Gluster-devel@gluster.org
>> https://lists.gluster.org/mailman/listinfo/gluster-devel
>>
>>
>
> --
> --
> https://kadalu.io
> Container Storage made easy!
>
> ---
>
> Community Meeting Calendar:
> Schedule -
> Every 2nd and 4th Tuesday at 14:30 IST / 09:00 UTC
> Bridge: https://meet.google.com/cpu-eiue-hvk
>
> Gluster-devel mailing list
> Gluster-devel@gluster.org
> https://lists.gluster.org/mailman/listinfo/gluster-devel
>
>
---

Community Meeting Calendar:
Schedule -
Every 2nd and 4th Tuesday at 14:30 IST / 09:00 UTC
Bridge: https://meet.google.com/cpu-eiue-hvk

Gluster-devel mailing list
Gluster-devel@gluster.org
https://lists.gluster.org/mailman/listinfo/gluster-devel



Re: [Gluster-devel] [PATCH] timer: fix ctx->timer memleak

2021-07-19 Thread Xavi Hernandez
Thanks for the patch. Could you send it to GitHub so that it can be
reviewed and merged using the regular procedure ?

You can find more information about contributing to the project here:
https://docs.gluster.org/en/latest/Developer-guide/Developers-Index/

Xavi

On Fri, Jul 16, 2021 at 10:43 AM  wrote:

> From: Zqiang 
>
> If create timer thread failed, the 'ctx->timer' need
> to be released.
>
> Signed-off-by: Zqiang 
> ---
>  libglusterfs/src/timer.c | 6 ++
>  1 file changed, 6 insertions(+)
>
> diff --git a/libglusterfs/src/timer.c b/libglusterfs/src/timer.c
> index 66c861b04c..2684d39667 100644
> --- a/libglusterfs/src/timer.c
> +++ b/libglusterfs/src/timer.c
> @@ -213,6 +213,12 @@ gf_timer_registry_init(glusterfs_ctx_t *ctx)
>  if (ret) {
>  gf_msg(THIS->name, GF_LOG_ERROR, ret, LG_MSG_PTHREAD_FAILED,
> "Thread creation failed");
> +   LOCK(>lock);
> +   reg = ctx->timer;
> +   ctx->timer = NULL;
> +   UNLOCK(>lock);
> +   GF_FREE(reg);
> +   reg = NULL;
>  }
>
>  out:
> --
> 2.25.1
>
>
---

Community Meeting Calendar:
Schedule -
Every 2nd and 4th Tuesday at 14:30 IST / 09:00 UTC
Bridge: https://meet.google.com/cpu-eiue-hvk

Gluster-devel mailing list
Gluster-devel@gluster.org
https://lists.gluster.org/mailman/listinfo/gluster-devel



Re: [Gluster-devel] [Gluster-users] high load when copy directory with many files

2021-04-21 Thread Xavi Hernandez
Hi Marco,

sorry for the late reply.

I've run some tests and I don't see any big difference between ls, stat and
getfattr. Can you provide more details about what test did you run ?

It would also help to provide a profile info for each test:

To start profile info: gluster volume profile  start
Before each test: gluster volume profile  info clear
After the test: gluster volume profile  info >/some/file

Regards,

Xavi

On Mon, Apr 12, 2021 at 9:01 AM Xavi Hernandez  wrote:

> On Sun, Apr 11, 2021 at 10:29 AM Amar Tumballi  wrote:
>
>> Hi Marco, this is really good test/info. Thanks.
>>
>> One more thing to observe is you are running such tests is 'gluster
>> profile info', so the bottleneck fop is listed.
>>
>> Mohit, Xavi, in this parallel operations, the load may be high due to
>> inodelk used in mds xattr update in dht? Or you guys suspect something else?
>>
>
> A profile info would be very useful to know which fop gets more requests.
> I think inodelk by itself shouldn't be an issue (I guess we are setting mds
> only once, right ?). In theory we shouldn't be sending any operation on an
> inode without a previous successful lookup, and in this case lookups should
> fail, so I don't clearly see what's the difference compared to an stat.
>
> We should investigate this. I'll try to do some experiments (not sure if
> this week, though).
>
> Regards,
>
> Xavi
>
>
>> Regards
>> Amar
>>
>> On Sat, 10 Apr, 2021, 11:45 pm Marco Lerda - FOREACH S.R.L., <
>> marco.le...@foreach.it> wrote:
>>
>>> hi,
>>> we have isolated the problem (meanwhile some hardware upgrade and code
>>> optimization helped to limit the problem).
>>> it happens when many request (HTTP over apache) comes to a non existent
>>> file.
>>> With 30 concurrent request to the same non existing file cause the load
>>> go high without limit.
>>> Same requests on existing files works fine.
>>> I have tried to simulate che apache access to file excluding apache with
>>> repeated command on files with the same parallelism (30):
>>> - with ls works fine, file exists or not
>>> - with stat works fine, file exists or not
>>> - with xattr load go up, file exists or not
>>>
>>> thank you
>>>
>>>
>>> Il 05/10/2020 19.45, Marco Lerda - FOREACH S.R.L. ha scritto:
>>> > hi,
>>> > we use glusterfs on a php application that have many small php files
>>> > images etc...
>>> > We use glusterfs in replication mode.
>>> > We have 2 nodes connected in fiber with 100MBps and less than 1 ms
>>> > latency.
>>> > We have also an arbiter on slower network (but the issue is there also
>>> > without the arbiter).
>>> > When we copy a directory (cp command) with many files, cpu usage and
>>> > load explode raplidly,
>>> > our application become inaccessible until the copy ends.
>>> >
>>> > I wonder if is that normal or we have done something wrong.
>>> > I know that glusterfs is not indicated with many small files, and I
>>> > know that it slow down,
>>> > but I want to avoid that a simple copy of a directory will put down
>>> > out application.
>>> >
>>> > Any suggestion?
>>> >
>>> > Thanks a lot
>>> >
>>> >
>>> >
>>> > 
>>> >
>>> >
>>> >
>>> > Community Meeting Calendar:
>>> >
>>> > Schedule -
>>> > Every 2nd and 4th Tuesday at 14:30 IST / 09:00 UTC
>>> > Bridge: https://bluejeans.com/441850968
>>> >
>>> > Gluster-users mailing list
>>> > gluster-us...@gluster.org
>>> > https://lists.gluster.org/mailman/listinfo/gluster-users
>>>
>>> --
>>>
>>> --
>>> Marco Lerda
>>> FOREACH S.R.L.
>>> Via Laghi di Avigliana 115, 12022 - Busca (CN)
>>> Telefono: 0171-1984102
>>> Centralino/Fax: 0171-1984100
>>> Email:  marco.le...@foreach.it
>>> Web: http://www.foreach.it
>>>
>>> 
>>>
>>>
>>>
>>> Community Meeting Calendar:
>>>
>>> Schedule -
>>> Every 2nd and 4th Tuesday at 14:30 IST / 09:00 UTC
>>> Bridge: https://meet.google.com/cpu-eiue-hvk
>>> Gluster-users mailing list
>>> gluster-us...@gluster.org
>>> https://lists.gluster.org/mailman/listinfo/gluster-users
>>>
>> ---
>>
>> Community Meeting Calendar:
>> Schedule -
>> Every 2nd and 4th Tuesday at 14:30 IST / 09:00 UTC
>> Bridge: https://meet.google.com/cpu-eiue-hvk
>>
>> Gluster-devel mailing list
>> Gluster-devel@gluster.org
>> https://lists.gluster.org/mailman/listinfo/gluster-devel
>>
>>
---

Community Meeting Calendar:
Schedule -
Every 2nd and 4th Tuesday at 14:30 IST / 09:00 UTC
Bridge: https://meet.google.com/cpu-eiue-hvk

Gluster-devel mailing list
Gluster-devel@gluster.org
https://lists.gluster.org/mailman/listinfo/gluster-devel



Re: [Gluster-devel] [Gluster-users] high load when copy directory with many files

2021-04-12 Thread Xavi Hernandez
On Sun, Apr 11, 2021 at 10:29 AM Amar Tumballi  wrote:

> Hi Marco, this is really good test/info. Thanks.
>
> One more thing to observe is you are running such tests is 'gluster
> profile info', so the bottleneck fop is listed.
>
> Mohit, Xavi, in this parallel operations, the load may be high due to
> inodelk used in mds xattr update in dht? Or you guys suspect something else?
>

A profile info would be very useful to know which fop gets more requests. I
think inodelk by itself shouldn't be an issue (I guess we are setting mds
only once, right ?). In theory we shouldn't be sending any operation on an
inode without a previous successful lookup, and in this case lookups should
fail, so I don't clearly see what's the difference compared to an stat.

We should investigate this. I'll try to do some experiments (not sure if
this week, though).

Regards,

Xavi


> Regards
> Amar
>
> On Sat, 10 Apr, 2021, 11:45 pm Marco Lerda - FOREACH S.R.L., <
> marco.le...@foreach.it> wrote:
>
>> hi,
>> we have isolated the problem (meanwhile some hardware upgrade and code
>> optimization helped to limit the problem).
>> it happens when many request (HTTP over apache) comes to a non existent
>> file.
>> With 30 concurrent request to the same non existing file cause the load
>> go high without limit.
>> Same requests on existing files works fine.
>> I have tried to simulate che apache access to file excluding apache with
>> repeated command on files with the same parallelism (30):
>> - with ls works fine, file exists or not
>> - with stat works fine, file exists or not
>> - with xattr load go up, file exists or not
>>
>> thank you
>>
>>
>> Il 05/10/2020 19.45, Marco Lerda - FOREACH S.R.L. ha scritto:
>> > hi,
>> > we use glusterfs on a php application that have many small php files
>> > images etc...
>> > We use glusterfs in replication mode.
>> > We have 2 nodes connected in fiber with 100MBps and less than 1 ms
>> > latency.
>> > We have also an arbiter on slower network (but the issue is there also
>> > without the arbiter).
>> > When we copy a directory (cp command) with many files, cpu usage and
>> > load explode raplidly,
>> > our application become inaccessible until the copy ends.
>> >
>> > I wonder if is that normal or we have done something wrong.
>> > I know that glusterfs is not indicated with many small files, and I
>> > know that it slow down,
>> > but I want to avoid that a simple copy of a directory will put down
>> > out application.
>> >
>> > Any suggestion?
>> >
>> > Thanks a lot
>> >
>> >
>> >
>> > 
>> >
>> >
>> >
>> > Community Meeting Calendar:
>> >
>> > Schedule -
>> > Every 2nd and 4th Tuesday at 14:30 IST / 09:00 UTC
>> > Bridge: https://bluejeans.com/441850968
>> >
>> > Gluster-users mailing list
>> > gluster-us...@gluster.org
>> > https://lists.gluster.org/mailman/listinfo/gluster-users
>>
>> --
>>
>> --
>> Marco Lerda
>> FOREACH S.R.L.
>> Via Laghi di Avigliana 115, 12022 - Busca (CN)
>> Telefono: 0171-1984102
>> Centralino/Fax: 0171-1984100
>> Email:  marco.le...@foreach.it
>> Web: http://www.foreach.it
>>
>> 
>>
>>
>>
>> Community Meeting Calendar:
>>
>> Schedule -
>> Every 2nd and 4th Tuesday at 14:30 IST / 09:00 UTC
>> Bridge: https://meet.google.com/cpu-eiue-hvk
>> Gluster-users mailing list
>> gluster-us...@gluster.org
>> https://lists.gluster.org/mailman/listinfo/gluster-users
>>
> ---
>
> Community Meeting Calendar:
> Schedule -
> Every 2nd and 4th Tuesday at 14:30 IST / 09:00 UTC
> Bridge: https://meet.google.com/cpu-eiue-hvk
>
> Gluster-devel mailing list
> Gluster-devel@gluster.org
> https://lists.gluster.org/mailman/listinfo/gluster-devel
>
>
---

Community Meeting Calendar:
Schedule -
Every 2nd and 4th Tuesday at 14:30 IST / 09:00 UTC
Bridge: https://meet.google.com/cpu-eiue-hvk

Gluster-devel mailing list
Gluster-devel@gluster.org
https://lists.gluster.org/mailman/listinfo/gluster-devel



Re: [Gluster-devel] Automatic clang-format for GitHub PRs

2021-02-14 Thread Xavi Hernandez
On Thu, Feb 11, 2021 at 5:50 PM Yaniv Kaul  wrote:

>
>
> On Thu, Feb 11, 2021 at 5:54 PM Amar Tumballi  wrote:
>
>>
>>
>> On Thu, 11 Feb, 2021, 9:19 pm Xavi Hernandez, 
>> wrote:
>>
>>> On Wed, Feb 10, 2021 at 1:33 PM Amar Tumballi  wrote:
>>>
>>>>
>>>>
>>>> On Wed, Feb 10, 2021 at 3:29 PM Xavi Hernandez 
>>>> wrote:
>>>>
>>>>> Hi all,
>>>>>
>>>>> I'm wondering if enforcing clang-format for all patches is a good
>>>>> idea...
>>>>>
>>>>> I've recently seen patches where clang-format is doing changes on
>>>>> parts of the code that have not been touched by the patch. Given that all
>>>>> files were already formatted by clang-format long ago, this shouldn't
>>>>> happen.
>>>>>
>>>>> This means that as the clang-format version evolves, the formatting
>>>>> with the same configuration is not the same. This introduces unnecessary
>>>>> noise to the file history that I think it should be avoided.
>>>>>
>>>>> Additionally, I've also seen some cases where some constructs are
>>>>> reformatted in an uglier or less clear way. I think it's very hard to come
>>>>> up with a set of rules that formats everything in the best possible way.
>>>>>
>>>>> For all these reasons, I would say we shouldn't enforce clang-format
>>>>> to accept a PR. I think it's a good test to run to catch some clear
>>>>> formatting issues, but it shouldn't vote for patch acceptance.
>>>>>
>>>>> What do you think ?
>>>>>
>>>>>
>>>> One thing I have noticed is, as long as some test is 'skipped', no one
>>>> bothers to check. It would be great if the whole diff (in case of failure)
>>>> is posted as a comment, so we can consider that while merging. I would
>>>> request one to invest time on posting the failure message as a comment back
>>>> into issue from jenkins if possible, and later implement skip behavior.
>>>> Otherwise, considering we have >10 people having ability to merge patches,
>>>> many people may miss having a look on clang-format issues.
>>>>
>>>
>>> I agree that it could be hard to enforce some rules, but what I'm seeing
>>> lately is that the clang-format version from Fedora 33 doesn't format the
>>> code the same way as a previous version with the same configuration in some
>>> cases (this also seems to happen with much older versions). This causes
>>> failures in the clang check that need manual modifications to update the
>>> patches.
>>>
>>
>> Ok, let's get moving with actual work than syntaxes. Ok with skipping!
>>
>
Before skipping I think it would be interesting to see if your idea of
posting the result of the clang-format test as a review comment with the
suggested changes is possible. It would be very easy then to check if they
make sense or not before merging.

@Deepshikha Khandelwal  do you know if it's possible ?


> If we could run a specific version within a container...
>

Even if we run it inside a container, sooner or later that container will
need to be upgraded to newer versions of software and libraries. When
clang-format is upgraded, patches will start modifying things that the
author didn't really touch, adding unnecessary and undocumented changes to
the gluster history. Additionally, it's not possible to automatically
format everything in the best possible way because in some cases one format
will be better than another (for example for readability), but in some
other cases the same code structure will be better represented in another
way.

We have the possibility of disabling clang-format in specific parts of the
code via a special comment, but I'm not sure if it's the right solution
either.

Regards,

Xavi

Y.
>
>>
>>
>>> Xavi
>>>
>> ---
>>
>> Community Meeting Calendar:
>> Schedule -
>> Every 2nd and 4th Tuesday at 14:30 IST / 09:00 UTC
>> Bridge: https://meet.google.com/cpu-eiue-hvk
>>
>> Gluster-devel mailing list
>> Gluster-devel@gluster.org
>> https://lists.gluster.org/mailman/listinfo/gluster-devel
>>
>>
---

Community Meeting Calendar:
Schedule -
Every 2nd and 4th Tuesday at 14:30 IST / 09:00 UTC
Bridge: https://meet.google.com/cpu-eiue-hvk

Gluster-devel mailing list
Gluster-devel@gluster.org
https://lists.gluster.org/mailman/listinfo/gluster-devel



Re: [Gluster-devel] Automatic clang-format for GitHub PRs

2021-02-11 Thread Xavi Hernandez
On Wed, Feb 10, 2021 at 1:33 PM Amar Tumballi  wrote:

>
>
> On Wed, Feb 10, 2021 at 3:29 PM Xavi Hernandez 
> wrote:
>
>> Hi all,
>>
>> I'm wondering if enforcing clang-format for all patches is a good idea...
>>
>> I've recently seen patches where clang-format is doing changes on parts
>> of the code that have not been touched by the patch. Given that all files
>> were already formatted by clang-format long ago, this shouldn't happen.
>>
>> This means that as the clang-format version evolves, the formatting with
>> the same configuration is not the same. This introduces unnecessary noise
>> to the file history that I think it should be avoided.
>>
>> Additionally, I've also seen some cases where some constructs are
>> reformatted in an uglier or less clear way. I think it's very hard to come
>> up with a set of rules that formats everything in the best possible way.
>>
>> For all these reasons, I would say we shouldn't enforce clang-format to
>> accept a PR. I think it's a good test to run to catch some clear formatting
>> issues, but it shouldn't vote for patch acceptance.
>>
>> What do you think ?
>>
>>
> One thing I have noticed is, as long as some test is 'skipped', no one
> bothers to check. It would be great if the whole diff (in case of failure)
> is posted as a comment, so we can consider that while merging. I would
> request one to invest time on posting the failure message as a comment back
> into issue from jenkins if possible, and later implement skip behavior.
> Otherwise, considering we have >10 people having ability to merge patches,
> many people may miss having a look on clang-format issues.
>

I agree that it could be hard to enforce some rules, but what I'm seeing
lately is that the clang-format version from Fedora 33 doesn't format the
code the same way as a previous version with the same configuration in some
cases (this also seems to happen with much older versions). This causes
failures in the clang check that need manual modifications to update the
patches.

Xavi
---

Community Meeting Calendar:
Schedule -
Every 2nd and 4th Tuesday at 14:30 IST / 09:00 UTC
Bridge: https://meet.google.com/cpu-eiue-hvk

Gluster-devel mailing list
Gluster-devel@gluster.org
https://lists.gluster.org/mailman/listinfo/gluster-devel



[Gluster-devel] Automatic clang-format for GitHub PRs

2021-02-10 Thread Xavi Hernandez
Hi all,

I'm wondering if enforcing clang-format for all patches is a good idea...

I've recently seen patches where clang-format is doing changes on parts of
the code that have not been touched by the patch. Given that all files were
already formatted by clang-format long ago, this shouldn't happen.

This means that as the clang-format version evolves, the formatting with
the same configuration is not the same. This introduces unnecessary noise
to the file history that I think it should be avoided.

Additionally, I've also seen some cases where some constructs are
reformatted in an uglier or less clear way. I think it's very hard to come
up with a set of rules that formats everything in the best possible way.

For all these reasons, I would say we shouldn't enforce clang-format to
accept a PR. I think it's a good test to run to catch some clear formatting
issues, but it shouldn't vote for patch acceptance.

What do you think ?

Regards,

Xavi
---

Community Meeting Calendar:
Schedule -
Every 2nd and 4th Tuesday at 14:30 IST / 09:00 UTC
Bridge: https://meet.google.com/cpu-eiue-hvk

Gluster-devel mailing list
Gluster-devel@gluster.org
https://lists.gluster.org/mailman/listinfo/gluster-devel



Re: [Gluster-devel] Pull Request review workflow

2020-10-15 Thread Xavi Hernandez
If everyone agrees, I'll prepare a PR with the changes in rfc.sh and
documentation to implement this change.

Xavi

On Thu, Oct 15, 2020 at 1:27 PM Ravishankar N 
wrote:

>
> On 15/10/20 4:36 pm, Sheetal Pamecha wrote:
>
>
> +1
> Just a note to the maintainers who are merging PRs to have patience and
> check the commit message when there are more than 1 commits in PR.
>
> Makes sense.
>
>
>>
>> Another thing to consider is that rfc.sh script always does a rebase
>>> before pushing changes. This rewrites history and changes all commits of a
>>> PR. I think we shouldn't do a rebase in rfc.sh. Only if there are
>>> conflicts, I would do a manual rebase and push the changes.
>>>
>>>
>>
>> I think we would also need to rebase if say some .t failure was fixed and
> we need to submit the PR on top of that, unless "run regression" always
> applies your PR on the latest HEAD in the concerned branch and triggers the
> regression.
>
>
> Actually True, Since the migration to github. I have not been using
> ./rfc.sh and For me it's easier and cleaner.
>
>
> Me as well :)
> -Ravi
> ___
>
> Community Meeting Calendar:
>
> Schedule -
> Every 2nd and 4th Tuesday at 14:30 IST / 09:00 UTC
> Bridge: https://bluejeans.com/441850968
>
>
>
>
> Gluster-devel mailing list
> Gluster-devel@gluster.org
> https://lists.gluster.org/mailman/listinfo/gluster-devel
>
>
___

Community Meeting Calendar:

Schedule -
Every 2nd and 4th Tuesday at 14:30 IST / 09:00 UTC
Bridge: https://bluejeans.com/441850968




Gluster-devel mailing list
Gluster-devel@gluster.org
https://lists.gluster.org/mailman/listinfo/gluster-devel



Re: [Gluster-devel] Pull Request review workflow

2020-10-15 Thread Xavi Hernandez
Hi Ravi,

On Thu, Oct 15, 2020 at 1:27 PM Ravishankar N 
wrote:

>
> On 15/10/20 4:36 pm, Sheetal Pamecha wrote:
>
>
> +1
> Just a note to the maintainers who are merging PRs to have patience and
> check the commit message when there are more than 1 commits in PR.
>
> Makes sense.
>
>
>>
>> Another thing to consider is that rfc.sh script always does a rebase
>>> before pushing changes. This rewrites history and changes all commits of a
>>> PR. I think we shouldn't do a rebase in rfc.sh. Only if there are
>>> conflicts, I would do a manual rebase and push the changes.
>>>
>>>
>>
>> I think we would also need to rebase if say some .t failure was fixed and
> we need to submit the PR on top of that, unless "run regression" always
> applies your PR on the latest HEAD in the concerned branch and triggers the
> regression.
>

Yes, I agree that sometimes we need a rebase, but I would do that only if
necessary by running a manual 'git rebase'.

I don't think we can do an automatic rebase before running a regression,
because there could be conflicts that cannot be fixed automatically.

Xavi

>
>
> Actually True, Since the migration to github. I have not been using
> ./rfc.sh and For me it's easier and cleaner.
>
>
> Me as well :)
> -Ravi
> ___
>
> Community Meeting Calendar:
>
> Schedule -
> Every 2nd and 4th Tuesday at 14:30 IST / 09:00 UTC
> Bridge: https://bluejeans.com/441850968
>
>
>
>
> Gluster-devel mailing list
> Gluster-devel@gluster.org
> https://lists.gluster.org/mailman/listinfo/gluster-devel
>
>
___

Community Meeting Calendar:

Schedule -
Every 2nd and 4th Tuesday at 14:30 IST / 09:00 UTC
Bridge: https://bluejeans.com/441850968




Gluster-devel mailing list
Gluster-devel@gluster.org
https://lists.gluster.org/mailman/listinfo/gluster-devel



[Gluster-devel] Pull Request review workflow

2020-10-15 Thread Xavi Hernandez
Hi all,

after the recent switch to GitHub, I've seen that reviews that require
multiple iterations are hard to follow using the old workflow we were using
in Gerrit.

Till now we basically amended the commit and pushed it again. Gerrit had a
feature to calculate diffs between versions of the patch, so it was
relatively easy to follow the changes between iterations (unless there was
a big change in the base branch and the patch was rebased).

In GitHub we don't have this feature (at least I haven't seen it). So I'm
proposing to change this workflow.

The idea is to create a PR with the initial commit. When a modification
needs to be done as a result of the review, instead of amending the
existing commit, we should create a new commit. From the review tool in
GitHub it's very easy to check individual commits.

Once the review is finished, the patch will be merged with the "Squash and
Merge" option, that will combine all the commits into a single one before
merging, so the end result will be exactly the same we had with Gerrit.

Another thing to consider is that rfc.sh script always does a rebase before
pushing changes. This rewrites history and changes all commits of a PR. I
think we shouldn't do a rebase in rfc.sh. Only if there are conflicts, I
would do a manual rebase and push the changes.

What do you think ?

Regards,

Xavi
___

Community Meeting Calendar:

Schedule -
Every 2nd and 4th Tuesday at 14:30 IST / 09:00 UTC
Bridge: https://bluejeans.com/441850968




Gluster-devel mailing list
Gluster-devel@gluster.org
https://lists.gluster.org/mailman/listinfo/gluster-devel



Re: [Gluster-devel] Weird full heal on Distributed-Disperse volume with sharding

2020-09-30 Thread Xavi Hernandez
Hi Dmitry,

On Wed, Sep 30, 2020 at 9:21 AM Dmitry Antipov  wrote:

> On 9/30/20 8:58 AM, Xavi Hernandez wrote:
>
> > This is normal. A dispersed volume writes encoded fragments of each
> block in each brick. In this case it's a 2+1 configuration, so each block
> is divided into 2 fragments. A third fragment is generated
> > for redundancy and stored on the third brick.
>
> OK. But for Distributed-Replicate 2 x 3 setup and 64K shards, 4M file
> should be split into (4096 / 64) * 3 = 192 shards, not 189. So why 189?
>

In fact, there aren't 189 shards. There are 63 shards replicated 3 times
each. The shard 0 is not inside the .shard directory. It's placed in the
directory where the file was created. So there are a total of 64 chunks of
64 KiB = 4 MiB.


> And if all bricks are considered equal and has enough amount of free
> space, shards distribution {24, 24, 24, 39, 39, 39} looks suboptimal.
>

Shards are distributed exactly equal as regular files. This means that they
are balanced based on a random distribution (with some correction when free
space is not equal, but this is irrelevant now). Random distributions tend
to balance very well the number of files, but only with a big number of
files. Statistics on a small number of files may be biased.

If you keep adding new files to the volume, the balance will improve.


> Why not {31, 32, 31, 32, 31, 32}? Isn't it a bug?
>

This can't happen. When you create a 2 x 3 replicated volume, you are
creating 2 independent replica 3 subvolumes. The first replica set is
composed of the first 3 bricks, and the second of the last 3. The
distribution layer chooses on which replica set to put each file.

It's not a bug. It's by design. Gluster can work with multiple clients
creating files simultaneously. To force a perfect distribution, all of them
would have to synchronize to decide where to create each file. This would
have a significant performance impact. Instead of that, distribution is
done randomly, which allows each client to work independently and it will
balance files pretty well in the long term.


> > This is not right. A disperse 2+1 configuration only supports a single
> failure. Wiping 2 fragments from the same file makes the file
> unrecoverable. Disperse works using the Reed-Solomon erasure code,
> > which requires at least 2 healthy fragments to recover the data (in a
> 2+1 configuration).
>
> It seems that I missed the point that all bricks are considered equal,
> regardless of the physical host they're attached to.
>

All bricks are considered equal inside a single replica/disperse set. A 2 x
(2 + 1) configuration has 2 independent disperse sets, so only one brick
from each of them may fail without data loss. If you want to support any 2
brick failures, you need to use a 1 x (4 + 2) configuration. In this case
there's a single disperse set which tolerates up to 2 brick failures.


>
> So, for the Distributed-Disperse 2 x (2 + 1) setup with 3 hosts, 2 bricks
> per each, and two files, A and B, it's possible to have
> the following layout:
>
> Host0:  Host1:  Host2:
> |- Brick0: A0 B0|- Brick0: A1   |- Brick0: A2
> |- Brick1: B1   |- Brick1: B2   |- Brick1:
>

No, this won't happen. A single file will go either to brick0 of all hosts
or brick1 of all hosts. They won't be mixed.


> This setup can tolerate single brick failure but not single host failure
> because if Host0 is down, two fragments of B will be lost
> and so B becomes unrecoverable (but A is not).
>
> If this is so, is it possible/hard to enforce 'one fragment per *host*'
> behavior? If we can guarantee the following:
>
> Host0:  Host1:  Host2:
> |- Brick0: A0   |- Brick0: A1   |- Brick0: A2
> |- Brick1: B1   |- Brick1: B2   |- Brick1: B0
>

This is how it currently works. You only need to take care of creating the
volume with the bricks in the right order. In this case the order should be
H0/B0, H1/B0, H2/B0, H0/B1, H1/B1, H1/B1. Anyway, if you create the volume
using an incorrect order and two bricks of the same disperse set are placed
in the same host, the operation will complain about it. This will only be
accepted by gluster if you create the volume with the 'force' option.

Regards,

Xavi


> this setup can tolerate both single brick and single host failures.
>
> Dmitry
>
>
___

Community Meeting Calendar:

Schedule -
Every 2nd and 4th Tuesday at 14:30 IST / 09:00 UTC
Bridge: https://bluejeans.com/441850968




Gluster-devel mailing list
Gluster-devel@gluster.org
https://lists.gluster.org/mailman/listinfo/gluster-devel



Re: [Gluster-devel] Weird full heal on Distributed-Disperse volume with sharding

2020-09-30 Thread Xavi Hernandez
Hi Dmitry,

my comments below...

On Tue, Sep 29, 2020 at 11:19 AM Dmitry Antipov  wrote:

> For the testing purposes, I've set up a localhost-only setup with 6x16M
> ramdisks (formatted as ext4) mounted (with '-o user_xattr') at
> /tmp/ram/{0,1,2,3,4,5} and SHARD_MIN_BLOCK_SIZE lowered to 4K. Finally
> the volume is:
>
> Volume Name: test
> Type: Distributed-Replicate
> Volume ID: 241d6679-7cd7-48b4-bdc5-8bc1c9940ac3
> Status: Started
> Snapshot Count: 0
> Number of Bricks: 2 x 3 = 6
> Transport-type: tcp
> Bricks:
> Brick1: [local-ip]:/tmp/ram/0
> Brick2: [local-ip]:/tmp/ram/1
> Brick3: [local-ip]:/tmp/ram/2
> Brick4: [local-ip]:/tmp/ram/3
> Brick5: [local-ip]:/tmp/ram/4
> Brick6: [local-ip]:/tmp/ram/5
> Options Reconfigured:
> features.shard-block-size: 64KB
> features.shard: on
> storage.fips-mode-rchecksum: on
> transport.address-family: inet
> nfs.disable: on
> performance.client-io-threads: off
>
> Then I mount it under /mnt/test:
>
> # mount -t glusterfs [local-ip]:/test /mnt/test
>
> and create 4M file on it:
>
> # dd if=/dev/random of=/mnt/test/file0 bs=1M count=4
>
> This creates 189 shards of 64K each, in /tmp/ram/?/.shard:
>
> /tmp/ram/0/.shard: 24
> /tmp/ram/1/.shard: 24
> /tmp/ram/2/.shard: 24
> /tmp/ram/3/.shard: 39
> /tmp/ram/4/.shard: 39
> /tmp/ram/5/.shard: 39
>
> To simulate data loss I just remove 2 arbitrary .shard directories,
> for example:
>
> # rm -rfv /tmp/ram/0/.shard /tmp/ram/5/.shard
>
> Finally, I do full heal:
>
> # gluster volume heal test full
>
> and successfully got all shards under /tmp/ram/{0,5}.shard back.
>
> But the things seems going weird for the following volume:
>
> Volume Name: test
> Type: Distributed-Disperse
> Volume ID: aa621c7e-1693-427a-9fd5-d7b38c27035e
> Status: Started
> Snapshot Count: 0
> Number of Bricks: 2 x (2 + 1) = 6
> Transport-type: tcp
> Bricks:
> Brick1: [local-ip]:/tmp/ram/0
> Brick2: [local-ip]:/tmp/ram/1
> Brick3: [local-ip]:/tmp/ram/2
> Brick4: [local-ip]:/tmp/ram/3
> Brick5: [local-ip]:/tmp/ram/4
> Brick6: [local-ip]:/tmp/ram/5
> Options Reconfigured:
> features.shard: on
> features.shard-block-size: 64KB
> storage.fips-mode-rchecksum: on
> transport.address-family: inet
> nfs.disable: on
>
> After creating 4M file as before, I've got the same 189 shards
> but 32K each.


This is normal. A dispersed volume writes encoded fragments of each block
in each brick. In this case it's a 2+1 configuration, so each block is
divided into 2 fragments. A third fragment is generated for redundancy and
stored on the third brick.


> After deleting /tmp/ram/{0,5}/.shard and full heal,
> I was able to get all shards back. But, after deleting
> /tmp/ram/{3,4}/.shard and full heal, I've ended up with the following:
>

This is not right. A disperse 2+1 configuration only supports a single
failure. Wiping 2 fragments from the same file makes the file
unrecoverable. Disperse works using the Reed-Solomon erasure code, which
requires at least 2 healthy fragments to recover the data (in a 2+1
configuration).

If you want to be able to recover from 2 disk failures, you need to create
a 4+2 configuration.

To make it more clear: a 2+1 configuration is like a traditional RAID5 with
3 disks. If you lose 2 disks, data is lost. A 4+2 is similar to a RAID6.

Regards,

Xavi


> /tmp/ram/0/.shard:
> -rw-r--r-- 2 root root 32768 Sep 29 12:01
> 951d7c52-7230-420b-b8bb-da887fffd41e.10
> -rw-r--r-- 2 root root 32768 Sep 29 12:01
> 951d7c52-7230-420b-b8bb-da887fffd41e.11
> -rw-r--r-- 2 root root 32768 Sep 29 12:01
> 951d7c52-7230-420b-b8bb-da887fffd41e.12
> -rw-r--r-- 2 root root 32768 Sep 29 12:01
> 951d7c52-7230-420b-b8bb-da887fffd41e.13
> -rw-r--r-- 2 root root 32768 Sep 29 12:01
> 951d7c52-7230-420b-b8bb-da887fffd41e.14
> -rw-r--r-- 2 root root 32768 Sep 29 12:01
> 951d7c52-7230-420b-b8bb-da887fffd41e.15
> -rw-r--r-- 2 root root 32768 Sep 29 12:01
> 951d7c52-7230-420b-b8bb-da887fffd41e.16
> -rw-r--r-- 2 root root 32768 Sep 29 12:01
> 951d7c52-7230-420b-b8bb-da887fffd41e.17
> -rw-r--r-- 2 root root 32768 Sep 29 12:01
> 951d7c52-7230-420b-b8bb-da887fffd41e.2
> -rw-r--r-- 2 root root 32768 Sep 29 12:01
> 951d7c52-7230-420b-b8bb-da887fffd41e.22
> -rw-r--r-- 2 root root 32768 Sep 29 12:01
> 951d7c52-7230-420b-b8bb-da887fffd41e.23
> -rw-r--r-- 2 root root 32768 Sep 29 12:01
> 951d7c52-7230-420b-b8bb-da887fffd41e.27
> -rw-r--r-- 2 root root 32768 Sep 29 12:01
> 951d7c52-7230-420b-b8bb-da887fffd41e.28
> -rw-r--r-- 2 root root 32768 Sep 29 12:01
> 951d7c52-7230-420b-b8bb-da887fffd41e.3
> -rw-r--r-- 2 root root 32768 Sep 29 12:01
> 951d7c52-7230-420b-b8bb-da887fffd41e.31
> -rw-r--r-- 2 root root 32768 Sep 29 12:01
> 951d7c52-7230-420b-b8bb-da887fffd41e.34
> -rw-r--r-- 2 root root 32768 Sep 29 12:01
> 951d7c52-7230-420b-b8bb-da887fffd41e.35
> -rw-r--r-- 2 root root 32768 Sep 29 12:01
> 951d7c52-7230-420b-b8bb-da887fffd41e.37
> -rw-r--r-- 2 root root 32768 Sep 29 12:01
> 951d7c52-7230-420b-b8bb-da887fffd41e.39
> -rw-r--r-- 2 root root 32768 Sep 29 12:01
> 

Re: [Gluster-devel] heal info output

2020-07-06 Thread Xavi Hernandez
Hi Emmanuel,

On Thu, Jul 2, 2020 at 3:05 AM Emmanuel Dreyfus  wrote:

> Hello
>
> gluster volume heal info show me questionable entries. I wonder if these
> are bugs, or if I shoud handle them and how.
>
> bidon# gluster volume heal gfs info
> Brick bidon:/export/wd0e_tmp
> Status: Connected
> Number of entries: 0
>
> Brick baril:/export/wd0e
> /.attribute/system
> 
> Status: Connected
> Number of entries: 2
>
> (...)
> Brick bidon:/export/wd2e
> 
> 
> /owncloud/data
> 
> 
> 
> 
>
> There are three cases:
> 1) /.attribute directory is special on NetBSD, it is where extended
> attributes are stored for the filesystem. The posix xlator takes care of
> screening it, but there must be some other softrware component that
> should learn it must disregeard it. Hints are welcome about where I
> should look at.
>

Is the '.attribute' directory only present on the root directory of a
filesystem ? if so I strongly recommend to never use the root of a
filesystem to place bricks. Always place the brick into a subdirectory.


> 2) /owncloud/data  is a directory. mode, owner and groups are the same
> on bricks. Why is it listed here?
>

If files or subdirectories have been created or removed from that directory
and the operation failed on some brick (or the brick was down), the
directory is also marked as bad. You should also check the contents.


> 3)  What should I do with this?
>

These are files or directories for whose real path is not known. If
gfid2path feature is enabled, you can check the
trusted.gfid2path.xx xattr on the gfid. It shows the gfid of the parent
directory and the file name. The full path can be retrieved by following
the directory symlinks or using the gfid-to-dirname.sh script in the extras
directory.

If gfid2path is not enabled, I fear that finding them will need to be done
by bruteforce:

1. Get the inode number of one of the gfid entries on one brick.
2. Run 'find  -inum 

Once you find the entries, if you do an 'stat' on the mount point of the
volume, the next "gluster volume heal info" should show the real path of
the files instead its gfid

Regards,

Xavi

-- 
> Emmanuel Dreyfus
> http://hcpnet.free.fr/pubz
> m...@netbsd.org
> ___
>
> Community Meeting Calendar:
>
> Schedule -
> Every 2nd and 4th Tuesday at 14:30 IST / 09:00 UTC
> Bridge: https://bluejeans.com/441850968
>
>
>
>
> Gluster-devel mailing list
> Gluster-devel@gluster.org
> https://lists.gluster.org/mailman/listinfo/gluster-devel
>
>
___

Community Meeting Calendar:

Schedule -
Every 2nd and 4th Tuesday at 14:30 IST / 09:00 UTC
Bridge: https://bluejeans.com/441850968




Gluster-devel mailing list
Gluster-devel@gluster.org
https://lists.gluster.org/mailman/listinfo/gluster-devel



Re: [Gluster-devel] [Gluster-users] Minutes of Gluster Community Meeting [12th May 2020]

2020-05-18 Thread Xavi Hernandez
Hi Sankarshan,

On Sat, May 16, 2020 at 9:15 AM sankarshan  wrote:

> On Fri, 15 May 2020 at 10:59, Hari Gowtham  wrote:
>
> > ### User stories
> > * [Hari] users are hesitant to upgrade. A good number of issues in
> release-7 (crashes, flooding of logs, self heal) Need to look into this.
> > * [Sunil] Increase in inode size
> https://lists.gluster.org/pipermail/gluster-users/2020-May/038196.html
> Looks like it can have perf benefit.
> >
>
> Is there work underway to ascertain if there are indeed any
> performance related benefits? What are the kind of tests which would
> be appropriate?
>

Rinku has done some tests downstream to validate that the change doesn't
cause any performance regression. Initial results don't show any regression
at all and it even provides a significant benefit for 'ls -l' and 'unlink'
workloads. I'm not sure yet why this happens as the xattrs for these tests
should already fit inside 512 bytes inodes, so no significant differences
were expected.

The real benefit would be with volumes that use at least geo-replication or
quotas. In this case the xattrs may not fit inside the 512 bytes inodes, so
1024 bytes inodes will reduce the number of disk requests when xattr data
is not cached (and it's not always cached, even if the inode is in cache).
This testing is pending.

>From the functional point of view, we also need to test that bigger inodes
don't cause weird inode allocation problems when available space is small.
XFS allocates inodes in contiguous chunks in disk, so it could happen that
even though there's enough space in disk (apparently), an inode cannot be
allocated due to fragmentation. Given that the inode size is bigger, the
required chunk will also be bigger, which could make this problem worse. We
should try to fill a volume with small files (with fsync pre file and
without it) and see if we get ENOSPC errors much before it's expected.

Any help validating our results or doing the remaining tests would be
appreciated.

Regards,

Xavi


>
> > * Any release updates?
> > * 6.9 is done and announced
> > * [Sunil]can we take this in for release-8:
> https://review.gluster.org/#/c/glusterfs/+/24396/
> > * [Rinku]Yes, we need to ask the patch owners to port this to
> release8 post merging it to master. Till the time we tag release8 this is
> possible post this it will be difficult, after which we can put it in
> release8.1
> > * [Csaba] This is necessary as well
> https://review.gluster.org/#/c/glusterfs/+/24415/
> > * [Rinku] We need release notes to be reviewed and merged release8
> is blocked due to this. https://review.gluster.org/#/c/glusterfs/+/24372/
>
> Have the small set of questions on the notes been addressed? Also, do
> we have plans to move this workflow over to GitHub issues? In other
> words, how long are we planning to continue to work with dual systems?
>
>
> > ### RoundTable
> > * [Sunil] Do we support cento8 and gluster?
> > * [sankarshan] Please highlight the concerns on the mailing list.
> The developers who do the manual testing can review and provide their
> assessment on where the project stands
> > * We do have packages, how are we testing it?
> > * [Sunil] Centos8 regression is having issues and are not being used for
> regression testing.
> > * [Hari] For packages, Shwetha and Sheetal are manually testing the bits
> with centos8. Basics works fine. But this testing isn't enough
> > * send out a mail to sort this out
>
> I am guessing that this was on Sunil to send out the note to the list.
> Will be looking forward to that.
>
> > * [Amar] Kadalu 0.7 release based on GlusterFS 7.5 has been recently
> released (Release Blog: https://kadalu.io/blog/kadalu-storage-0.7)
> > * [Rinku] How to test
> > * [Aravinda]
> https://kadalu.io/docs/k8s-storage/latest/quick-start
>
>
>
>
> --
> sankars...@kadalu.io | TZ: UTC+0530 | +91 99606 03294
> kadalu.io : Making it easy to provision storage in k8s!
> ___
>
> Community Meeting Calendar:
>
> Schedule -
> Every 2nd and 4th Tuesday at 14:30 IST / 09:00 UTC
> Bridge: https://bluejeans.com/441850968
>
>
>
>
> Gluster-devel mailing list
> Gluster-devel@gluster.org
> https://lists.gluster.org/mailman/listinfo/gluster-devel
>
>
___

Community Meeting Calendar:

Schedule -
Every 2nd and 4th Tuesday at 14:30 IST / 09:00 UTC
Bridge: https://bluejeans.com/441850968




Gluster-devel mailing list
Gluster-devel@gluster.org
https://lists.gluster.org/mailman/listinfo/gluster-devel



Re: [Gluster-devel] What do extra_free and extrastd_free params do in the dictionary object?

2020-01-13 Thread Xavi Hernandez
On Mon, Jan 13, 2020 at 9:41 AM Yaniv Kaul  wrote:

> To conclude and ensure I understood both:
> - I've sent a patch to remove extra_free parameter (and the associated
> gf_memdup() in 2 places) -
> https://review.gluster.org/#/c/glusterfs/+/23999/ (waiting for CI to
> finish working on it).
> - I'll send a patch to remove extra_stdfree - but this one is slightly
> bigger:
> 1. There are more places it's set (18).
> 2. Somehow, we need to release that memory. I kinda assume the sooner the
> better?
> So perhaps I can just replace:
> dict->extra_stdfree = cli_req.dict.dict_val;
> with:
> free(cli_req.dict.dict_val) ?
>

Yes. I think this is the best and easiest approach.


> 3. In some places, there was actually a call to GF_FREE, which is kind of
> confusing. For example, in __server_get_snap_info() :
> if (snap_info_rsp.dict.dict_val) {
> GF_FREE(snap_info_rsp.dict.dict_val);
> }
>

This seems like a bug. Additionally, this memory should be released using
free() instead of GF_FREE().


>
> I think I should remove that and stick to freeing right after
> unserialization?
>

Yes. I agree.

Regards,

Xavi


>
> On Thu, Jan 9, 2020 at 12:42 PM Xavi Hernandez 
> wrote:
>
>> On Thu, Jan 9, 2020 at 11:11 AM Yaniv Kaul  wrote:
>>
>>>
>>>
>>> On Thu, Jan 9, 2020 at 11:35 AM Xavi Hernandez 
>>> wrote:
>>>
>>>> On Thu, Jan 9, 2020 at 10:22 AM Amar Tumballi  wrote:
>>>>
>>>>>
>>>>>
>>>>> On Thu, Jan 9, 2020 at 2:33 PM Xavi Hernandez 
>>>>> wrote:
>>>>>
>>>>>> On Thu, Jan 9, 2020 at 9:44 AM Amar Tumballi 
>>>>>> wrote:
>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On Thu, Jan 9, 2020 at 1:38 PM Xavi Hernandez 
>>>>>>> wrote:
>>>>>>>
>>>>>>>> On Sun, Dec 22, 2019 at 4:56 PM Yaniv Kaul 
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> I could not find a relevant use for them. Can anyone enlighten me?
>>>>>>>>>
>>>>>>>>
>>>>>>>> I'm not sure why they are needed. They seem to be used to keep the
>>>>>>>> unserialized version of a dict around until the dict is destroyed. I
>>>>>>>> thought this could be because we were using pointers to the 
>>>>>>>> unserialized
>>>>>>>> data inside dict, but that's not the case currently. However, checking 
>>>>>>>> very
>>>>>>>> old versions (pre 3.2), I see that dict values were not allocated, but 
>>>>>>>> a
>>>>>>>> pointer to the unserialized data was used.
>>>>>>>>
>>>>>>>
>>>>>>> Xavi,
>>>>>>>
>>>>>>> While you are right about the intent, it is used still, at least
>>>>>>> when I grepped latest repo to keep a reference in protocol layer.
>>>>>>>
>>>>>>> This is done to reduce a copy after the dictionary's binary content
>>>>>>> is received from RPC. The 'extra_free' flag is used when we use a
>>>>>>> GF_*ALLOC()'d buffer in protocol to receive dictionary, and 
>>>>>>> extra_stdfree
>>>>>>> is used when RPC itself allocates the buffer and hence uses 'free()' to
>>>>>>> free the buffer.
>>>>>>>
>>>>>>
>>>>>> I don't see it. When dict_unserialize() is called, key and value are
>>>>>> allocated and copied, so  why do we need to keep the raw data after that 
>>>>>> ?
>>>>>>
>>>>>> In 3.1 the value was simply a pointer to the unserialized data, but
>>>>>> starting with 3.2, value is memdup'ed. Key is always copied. I don't see
>>>>>> any other reference to the unserialized data right now. I think that
>>>>>> instead of assigning the raw data to extra_(std)free, we should simply
>>>>>> release that memory and remove those fields.
>>>>>>
>>>>>> Am I missing something else ?
>>>>>>
>>>>>
>>>>> I did grep on 'extra_stdfree' and 'extra_free' and saw that many
>>>>> handshake/ and protocol code seemed to use it. Haven't gone deeper to 
>>>>> check
>>>&

Re: [Gluster-devel] What do extra_free and extrastd_free params do in the dictionary object?

2020-01-09 Thread Xavi Hernandez
On Thu, Jan 9, 2020 at 11:11 AM Yaniv Kaul  wrote:

>
>
> On Thu, Jan 9, 2020 at 11:35 AM Xavi Hernandez 
> wrote:
>
>> On Thu, Jan 9, 2020 at 10:22 AM Amar Tumballi  wrote:
>>
>>>
>>>
>>> On Thu, Jan 9, 2020 at 2:33 PM Xavi Hernandez 
>>> wrote:
>>>
>>>> On Thu, Jan 9, 2020 at 9:44 AM Amar Tumballi  wrote:
>>>>
>>>>>
>>>>>
>>>>> On Thu, Jan 9, 2020 at 1:38 PM Xavi Hernandez 
>>>>> wrote:
>>>>>
>>>>>> On Sun, Dec 22, 2019 at 4:56 PM Yaniv Kaul  wrote:
>>>>>>
>>>>>>> I could not find a relevant use for them. Can anyone enlighten me?
>>>>>>>
>>>>>>
>>>>>> I'm not sure why they are needed. They seem to be used to keep the
>>>>>> unserialized version of a dict around until the dict is destroyed. I
>>>>>> thought this could be because we were using pointers to the unserialized
>>>>>> data inside dict, but that's not the case currently. However, checking 
>>>>>> very
>>>>>> old versions (pre 3.2), I see that dict values were not allocated, but a
>>>>>> pointer to the unserialized data was used.
>>>>>>
>>>>>
>>>>> Xavi,
>>>>>
>>>>> While you are right about the intent, it is used still, at least when
>>>>> I grepped latest repo to keep a reference in protocol layer.
>>>>>
>>>>> This is done to reduce a copy after the dictionary's binary content is
>>>>> received from RPC. The 'extra_free' flag is used when we use a
>>>>> GF_*ALLOC()'d buffer in protocol to receive dictionary, and extra_stdfree
>>>>> is used when RPC itself allocates the buffer and hence uses 'free()' to
>>>>> free the buffer.
>>>>>
>>>>
>>>> I don't see it. When dict_unserialize() is called, key and value are
>>>> allocated and copied, so  why do we need to keep the raw data after that ?
>>>>
>>>> In 3.1 the value was simply a pointer to the unserialized data, but
>>>> starting with 3.2, value is memdup'ed. Key is always copied. I don't see
>>>> any other reference to the unserialized data right now. I think that
>>>> instead of assigning the raw data to extra_(std)free, we should simply
>>>> release that memory and remove those fields.
>>>>
>>>> Am I missing something else ?
>>>>
>>>
>>> I did grep on 'extra_stdfree' and 'extra_free' and saw that many
>>> handshake/ and protocol code seemed to use it. Haven't gone deeper to check
>>> which part.
>>>
>>> [amar@kadalu glusterfs]$ git grep extra_stdfree | wc -l
>>> 40
>>> [amar@kadalu glusterfs]$ git grep extra_free | wc -l
>>> 5
>>>
>>
>> Yes, they call dict_unserialize() and then store the unserialized data
>> into those variables. That's what I'm saying it's not necessary.
>>
>
> In at least 2 cases, there's even something stranger I could not
> understand (see in bold - from server_setvolume() function) :
> *params = dict_new();*
> reply = dict_new();
> ret = xdr_to_generic(req->msg[0], ,
> (xdrproc_t)xdr_gf_setvolume_req);
> if (ret < 0) {
> // failed to decode msg;
> req->rpc_err = GARBAGE_ARGS;
> goto fail;
> }
> ctx = THIS->ctx;
>
> this = req->svc->xl;
> /* this is to ensure config_params is populated with the first brick
>  * details at first place if brick multiplexing is enabled
>  */
> config_params = dict_copy_with_ref(this->options, NULL);
>
> *buf = gf_memdup(args.dict.dict_val, args.dict.dict_len);*
>

This is probably unnecessary if we can really remove extra_free. Probably
it's here because args.dict.dict_val will be destroyed later.


> if (buf == NULL) {
> op_ret = -1;
> op_errno = ENOMEM;
> goto fail;
> }
>
> *ret = dict_unserialize(buf, args.dict.dict_len, );*
> if (ret < 0) {
> ret = dict_set_str(reply, "ERROR",
>"Internal error: failed to unserialize "
>"request dictionary");
> if (ret < 0)
> gf_msg_debug(this->name, 0,
>  "failed to set error "
>  "msg \"%s\"",
>  "Int

Re: [Gluster-devel] What do extra_free and extrastd_free params do in the dictionary object?

2020-01-09 Thread Xavi Hernandez
On Thu, Jan 9, 2020 at 10:22 AM Amar Tumballi  wrote:

>
>
> On Thu, Jan 9, 2020 at 2:33 PM Xavi Hernandez  wrote:
>
>> On Thu, Jan 9, 2020 at 9:44 AM Amar Tumballi  wrote:
>>
>>>
>>>
>>> On Thu, Jan 9, 2020 at 1:38 PM Xavi Hernandez 
>>> wrote:
>>>
>>>> On Sun, Dec 22, 2019 at 4:56 PM Yaniv Kaul  wrote:
>>>>
>>>>> I could not find a relevant use for them. Can anyone enlighten me?
>>>>>
>>>>
>>>> I'm not sure why they are needed. They seem to be used to keep the
>>>> unserialized version of a dict around until the dict is destroyed. I
>>>> thought this could be because we were using pointers to the unserialized
>>>> data inside dict, but that's not the case currently. However, checking very
>>>> old versions (pre 3.2), I see that dict values were not allocated, but a
>>>> pointer to the unserialized data was used.
>>>>
>>>
>>> Xavi,
>>>
>>> While you are right about the intent, it is used still, at least when I
>>> grepped latest repo to keep a reference in protocol layer.
>>>
>>> This is done to reduce a copy after the dictionary's binary content is
>>> received from RPC. The 'extra_free' flag is used when we use a
>>> GF_*ALLOC()'d buffer in protocol to receive dictionary, and extra_stdfree
>>> is used when RPC itself allocates the buffer and hence uses 'free()' to
>>> free the buffer.
>>>
>>
>> I don't see it. When dict_unserialize() is called, key and value are
>> allocated and copied, so  why do we need to keep the raw data after that ?
>>
>> In 3.1 the value was simply a pointer to the unserialized data, but
>> starting with 3.2, value is memdup'ed. Key is always copied. I don't see
>> any other reference to the unserialized data right now. I think that
>> instead of assigning the raw data to extra_(std)free, we should simply
>> release that memory and remove those fields.
>>
>> Am I missing something else ?
>>
>
> I did grep on 'extra_stdfree' and 'extra_free' and saw that many
> handshake/ and protocol code seemed to use it. Haven't gone deeper to check
> which part.
>
> [amar@kadalu glusterfs]$ git grep extra_stdfree | wc -l
> 40
> [amar@kadalu glusterfs]$ git grep extra_free | wc -l
> 5
>

Yes, they call dict_unserialize() and then store the unserialized data into
those variables. That's what I'm saying it's not necessary.


>
>>
>>
>>>
>>>> I think this is not needed anymore. Probably we could remove these
>>>> fields if that's the only reason.
>>>>
>>>
>>> If keeping them is hard to maintain, we can add few allocation to remove
>>> those elements, that shouldn't matter much IMO too. We are not using
>>> dictionary itself as protocol now (which we did in 1.x series though).
>>>
>>> Regards,
>>> Amar
>>> ---
>>> https://kadalu.io
>>>
>>>
>>>
>>>> TIA,
>>>>> Y.
>>>>> ___
>>>>>
>>>>> Community Meeting Calendar:
>>>>>
>>>>> APAC Schedule -
>>>>> Every 2nd and 4th Tuesday at 11:30 AM IST
>>>>> Bridge: https://bluejeans.com/441850968
>>>>>
>>>>>
>>>>> NA/EMEA Schedule -
>>>>> Every 1st and 3rd Tuesday at 01:00 PM EDT
>>>>> Bridge: https://bluejeans.com/441850968
>>>>>
>>>>> Gluster-devel mailing list
>>>>> Gluster-devel@gluster.org
>>>>> https://lists.gluster.org/mailman/listinfo/gluster-devel
>>>>>
>>>>> ___
>>>>
>>>> Community Meeting Calendar:
>>>>
>>>> APAC Schedule -
>>>> Every 2nd and 4th Tuesday at 11:30 AM IST
>>>> Bridge: https://bluejeans.com/441850968
>>>>
>>>>
>>>> NA/EMEA Schedule -
>>>> Every 1st and 3rd Tuesday at 01:00 PM EDT
>>>> Bridge: https://bluejeans.com/441850968
>>>>
>>>> Gluster-devel mailing list
>>>> Gluster-devel@gluster.org
>>>> https://lists.gluster.org/mailman/listinfo/gluster-devel
>>>>
>>>>
___

Community Meeting Calendar:

APAC Schedule -
Every 2nd and 4th Tuesday at 11:30 AM IST
Bridge: https://bluejeans.com/441850968


NA/EMEA Schedule -
Every 1st and 3rd Tuesday at 01:00 PM EDT
Bridge: https://bluejeans.com/441850968

Gluster-devel mailing list
Gluster-devel@gluster.org
https://lists.gluster.org/mailman/listinfo/gluster-devel



Re: [Gluster-devel] What do extra_free and extrastd_free params do in the dictionary object?

2020-01-09 Thread Xavi Hernandez
On Thu, Jan 9, 2020 at 9:44 AM Amar Tumballi  wrote:

>
>
> On Thu, Jan 9, 2020 at 1:38 PM Xavi Hernandez  wrote:
>
>> On Sun, Dec 22, 2019 at 4:56 PM Yaniv Kaul  wrote:
>>
>>> I could not find a relevant use for them. Can anyone enlighten me?
>>>
>>
>> I'm not sure why they are needed. They seem to be used to keep the
>> unserialized version of a dict around until the dict is destroyed. I
>> thought this could be because we were using pointers to the unserialized
>> data inside dict, but that's not the case currently. However, checking very
>> old versions (pre 3.2), I see that dict values were not allocated, but a
>> pointer to the unserialized data was used.
>>
>
> Xavi,
>
> While you are right about the intent, it is used still, at least when I
> grepped latest repo to keep a reference in protocol layer.
>
> This is done to reduce a copy after the dictionary's binary content is
> received from RPC. The 'extra_free' flag is used when we use a
> GF_*ALLOC()'d buffer in protocol to receive dictionary, and extra_stdfree
> is used when RPC itself allocates the buffer and hence uses 'free()' to
> free the buffer.
>

I don't see it. When dict_unserialize() is called, key and value are
allocated and copied, so  why do we need to keep the raw data after that ?

In 3.1 the value was simply a pointer to the unserialized data, but
starting with 3.2, value is memdup'ed. Key is always copied. I don't see
any other reference to the unserialized data right now. I think that
instead of assigning the raw data to extra_(std)free, we should simply
release that memory and remove those fields.

Am I missing something else ?


>
>> I think this is not needed anymore. Probably we could remove these fields
>> if that's the only reason.
>>
>
> If keeping them is hard to maintain, we can add few allocation to remove
> those elements, that shouldn't matter much IMO too. We are not using
> dictionary itself as protocol now (which we did in 1.x series though).
>
> Regards,
> Amar
> ---
> https://kadalu.io
>
>
>
>> TIA,
>>> Y.
>>> ___
>>>
>>> Community Meeting Calendar:
>>>
>>> APAC Schedule -
>>> Every 2nd and 4th Tuesday at 11:30 AM IST
>>> Bridge: https://bluejeans.com/441850968
>>>
>>>
>>> NA/EMEA Schedule -
>>> Every 1st and 3rd Tuesday at 01:00 PM EDT
>>> Bridge: https://bluejeans.com/441850968
>>>
>>> Gluster-devel mailing list
>>> Gluster-devel@gluster.org
>>> https://lists.gluster.org/mailman/listinfo/gluster-devel
>>>
>>> ___
>>
>> Community Meeting Calendar:
>>
>> APAC Schedule -
>> Every 2nd and 4th Tuesday at 11:30 AM IST
>> Bridge: https://bluejeans.com/441850968
>>
>>
>> NA/EMEA Schedule -
>> Every 1st and 3rd Tuesday at 01:00 PM EDT
>> Bridge: https://bluejeans.com/441850968
>>
>> Gluster-devel mailing list
>> Gluster-devel@gluster.org
>> https://lists.gluster.org/mailman/listinfo/gluster-devel
>>
>>
___

Community Meeting Calendar:

APAC Schedule -
Every 2nd and 4th Tuesday at 11:30 AM IST
Bridge: https://bluejeans.com/441850968


NA/EMEA Schedule -
Every 1st and 3rd Tuesday at 01:00 PM EDT
Bridge: https://bluejeans.com/441850968

Gluster-devel mailing list
Gluster-devel@gluster.org
https://lists.gluster.org/mailman/listinfo/gluster-devel



Re: [Gluster-devel] What do extra_free and extrastd_free params do in the dictionary object?

2020-01-09 Thread Xavi Hernandez
On Sun, Dec 22, 2019 at 4:56 PM Yaniv Kaul  wrote:

> I could not find a relevant use for them. Can anyone enlighten me?
>

I'm not sure why they are needed. They seem to be used to keep the
unserialized version of a dict around until the dict is destroyed. I
thought this could be because we were using pointers to the unserialized
data inside dict, but that's not the case currently. However, checking very
old versions (pre 3.2), I see that dict values were not allocated, but a
pointer to the unserialized data was used.

I think this is not needed anymore. Probably we could remove these fields
if that's the only reason.

> TIA,
> Y.
> ___
>
> Community Meeting Calendar:
>
> APAC Schedule -
> Every 2nd and 4th Tuesday at 11:30 AM IST
> Bridge: https://bluejeans.com/441850968
>
>
> NA/EMEA Schedule -
> Every 1st and 3rd Tuesday at 01:00 PM EDT
> Bridge: https://bluejeans.com/441850968
>
> Gluster-devel mailing list
> Gluster-devel@gluster.org
> https://lists.gluster.org/mailman/listinfo/gluster-devel
>
>
___

Community Meeting Calendar:

APAC Schedule -
Every 2nd and 4th Tuesday at 11:30 AM IST
Bridge: https://bluejeans.com/441850968


NA/EMEA Schedule -
Every 1st and 3rd Tuesday at 01:00 PM EDT
Bridge: https://bluejeans.com/441850968

Gluster-devel mailing list
Gluster-devel@gluster.org
https://lists.gluster.org/mailman/listinfo/gluster-devel



Re: [Gluster-devel] [RFC] inode table locking contention reduction experiment

2019-10-30 Thread Xavi Hernandez
Hi Changwei,

On Tue, Oct 29, 2019 at 7:56 AM Changwei Ge  wrote:

> Hi,
>
> I am recently working on reducing inode_[un]ref() locking contention by
> getting rid of inode table lock. Just use inode lock to protect inode
> REF. I have already discussed a couple rounds with several Glusterfs
> developers via emails and Gerrit and basically get understood on major
> logic around.
>
> Currently, inode REF can be ZERO and be reused by increasing it to ONE.
> This is IMO why we have to burden so much work for inode table when
> REF/UNREF. It makes inode [un]ref() and inode table and dentries(alias)
> searching hard to run concurrently.
>
> So my question is in what cases, how can we find a inode whose REF is ZERO?
>
> As Glusterfs store its inode memory address into kernel/fuse, can we
> conclude that only fuse_ino_to_inode() can bring back a REF=0 inode?
>

Yes, when an inode gets refs = 0, it means that gluster code is not using
it anywhere, so it cannot be referenced again unless kernel sends new
requests on the same inode. Once refs=0 and nlookup=0, the inode can be
destroyed.

Inode code is quite complex right now and I haven't had time to investigate
this further, but I think we could simplify inode management significantly
(specially unref) if we add a reference when nlookup becomes > 0, and
remove a reference when nlookup becomes 0 again. Maybe with this approach
we could avoid inode table lock in many cases. However we need to make sure
we correctly handle invalidation logic to keep inode table size under
control.

Regards,

Xavi


>
> Thanks,
> Changwei
> ___
>
> Community Meeting Calendar:
>
> APAC Schedule -
> Every 2nd and 4th Tuesday at 11:30 AM IST
> Bridge: https://bluejeans.com/118564314
>
> NA/EMEA Schedule -
> Every 1st and 3rd Tuesday at 01:00 PM EDT
> Bridge: https://bluejeans.com/118564314
>
> Gluster-devel mailing list
> Gluster-devel@gluster.org
> https://lists.gluster.org/mailman/listinfo/gluster-devel
>
>
___

Community Meeting Calendar:

APAC Schedule -
Every 2nd and 4th Tuesday at 11:30 AM IST
Bridge: https://bluejeans.com/118564314

NA/EMEA Schedule -
Every 1st and 3rd Tuesday at 01:00 PM EDT
Bridge: https://bluejeans.com/118564314

Gluster-devel mailing list
Gluster-devel@gluster.org
https://lists.gluster.org/mailman/listinfo/gluster-devel



Re: [Gluster-devel] Regards to taking lock in dictionary

2019-10-24 Thread Xavi Hernandez
Hi Mohit,

On Thu, Oct 24, 2019 at 5:19 AM Mohit Agrawal  wrote:

>
> I have a query why do we take a lock at the time of doing an operation in
> a dictionary.I have observed in testing it seems there is no codepath where
>   we are using the dictionary parallel. In theory, the dictionary flow is
> like one xlator put some data in a dictionary and pass to the next xlator
> and xlator  and originator xlator won't touch dictionary until the called
> xlator returns.
>
>   To prove the same I have executed below test case
>   1) I have changed all LOCK/UNLOCK macro with dictlock/dictunlock
> function in the dictionary code and call the LOCK/UNLOCK macros
>   2) Create a 1x3 volume and mount the volume
>   3) Run stap script on one of the node to measure dict lock contention to
> log the entry if more than one thread access the dictionary at the same time
>   4) Run smallfile tool like below with multiple operations like
> create/append/cleanup
>  /root/sync.sh; python /root/smallfile/smallfile_cli.py --operation
>  --threads 8 --file-size 64 --files 5000 --top /mnt/test
>  --host-set "hp-m300-2.gsslab.pnq.redhat.com";
>
>I have not found any single thread that is trying to access the
> dictionary while dictlock is already held by some other thread.
>
>   I have uploaded a patch(
> https://review.gluster.org/#/c/glusterfs/+/23603/) after converting the
> if condition to false in dictlock/unlock and run the
>   regression test suite.I am not getting major failures after removing the
> lock from a dictionary.
>
>   Please share your view on the same if the dictionary is not consumed by
> multiple threads at the same time still we do need to take lock
>   in the dictionary.
>   Please share if I need to test something more to validate the same.
>

That's very interesting. From a logical point of view I think that we
shouldn't have two threads accessing the same dict at the same time. The
only thing that locks guarantee is the internal integrity of the dict
structure but, if we have concurrent access to the dict, one of the threads
could see keys and/or values appearing/disappearing/changing spuriously,
which surely could make things not work well. Apparently we don't have
these issues.

If that's true, I think we should be able to completely get rid of the lock
in dict structure.

This change is dangerous however, and it would need extensive testing. If
any issue is found, probably we would need to fix that issue instead of
adding locks again though.

Another important thing to do is to run some performance tests. If we
really don't have contention in dict lock, the real cost of the lock is an
atomic operation, which is not free but it's much cheaper than the kernel
context that happens in case of contention. We would need to see the
improvement to weight the benefits of this change.

Xavi



> Regards,
> Mohit Agrawal
> ___
>
> Community Meeting Calendar:
>
> APAC Schedule -
> Every 2nd and 4th Tuesday at 11:30 AM IST
> Bridge: https://bluejeans.com/118564314
>
> NA/EMEA Schedule -
> Every 1st and 3rd Tuesday at 01:00 PM EDT
> Bridge: https://bluejeans.com/118564314
>
> Gluster-devel mailing list
> Gluster-devel@gluster.org
> https://lists.gluster.org/mailman/listinfo/gluster-devel
>
>
___

Community Meeting Calendar:

APAC Schedule -
Every 2nd and 4th Tuesday at 11:30 AM IST
Bridge: https://bluejeans.com/118564314

NA/EMEA Schedule -
Every 1st and 3rd Tuesday at 01:00 PM EDT
Bridge: https://bluejeans.com/118564314

Gluster-devel mailing list
Gluster-devel@gluster.org
https://lists.gluster.org/mailman/listinfo/gluster-devel



Re: [Gluster-devel] Solving Ctime Issue with legacy files [BUG 1593542]

2019-06-18 Thread Xavi Hernandez
Hi Kotresh,

On Tue, Jun 18, 2019 at 8:33 AM Kotresh Hiremath Ravishankar <
khire...@redhat.com> wrote:

> Hi Xavi,
>
> Reply inline.
>
> On Mon, Jun 17, 2019 at 5:38 PM Xavi Hernandez 
> wrote:
>
>> Hi Kotresh,
>>
>> On Mon, Jun 17, 2019 at 1:50 PM Kotresh Hiremath Ravishankar <
>> khire...@redhat.com> wrote:
>>
>>> Hi All,
>>>
>>> The ctime feature is enabled by default from release gluster-6. But as
>>> explained in bug [1]  there is a known issue with legacy files i.e., the
>>> files which are created before ctime feature is enabled. These files would
>>> not have "trusted.glusterfs.mdata" xattr which maintain time attributes. So
>>> on, accessing those files, it gets created with latest time attributes.
>>> This is not correct because all the time attributes (atime, mtime, ctime)
>>> get updated instead of required time attributes.
>>>
>>> There are couple of approaches to solve this.
>>>
>>> 1. On accessing the files, let the posix update the time attributes
>>> from  the back end file on respective replicas. This obviously results in
>>> inconsistent "trusted.glusterfs.mdata" xattr values with in replica set.
>>> AFR/EC should heal this xattr as part of metadata heal upon accessing this
>>> file. It can chose to replicate from any subvolume. Ideally we should
>>> consider the highest time from the replica and treat it as source but I
>>> think that should be fine as replica time attributes are mostly in sync
>>> with max difference in order of few seconds if am not wrong.
>>>
>>>But client side self heal is disabled by default because of
>>> performance reasons [2]. If we chose to go by this approach, we need to
>>> consider enabling at least client side metadata self heal by default.
>>> Please share your thoughts on enabling the same by default.
>>>
>>> 2. Don't let posix update the legacy files from the backend. On lookup
>>> cbk, let the utime xlator update the time attributes from statbuf received
>>> synchronously.
>>>
>>> Both approaches are similar as both results in updating the xattr during
>>> lookup. Please share your inputs on which approach is better.
>>>
>>
>> I prefer second approach. First approach is not feasible for EC volumes
>> because self-heal requires that k bricks (on a k+r configuration) agree on
>> the value of this xattr, otherwise it considers the metadata damaged and
>> needs manual intervention to fix it. During upgrade, first r bricks with be
>> upgraded without problems, but trusted.glusterfs.mdata won't be healed
>> because r < k. In fact this xattr will be removed from new bricks because
>> the majority of bricks agree on xattr not being present. Once the r+1 brick
>> is upgraded, it's possible that posix sets different values for
>> trusted.glusterfs.mdata, which will cause self-heal to fail.
>>
>> Second approach seems better to me if guarded by a new option that
>> enables this behavior. utime xlator should only update the mdata xattr if
>> that option is set, and that option should only be settable once all nodes
>> have been upgraded (controlled by op-version). In this situation the first
>> lookup on a file where utime detects that mdata is not set, will require a
>> synchronous update. I think this is good enough because it will only happen
>> once per file. We'll need to consider cases where different clients do
>> lookups at the same time, but I think this can be easily solved by ignoring
>> the request if mdata is already present.
>>
>
> Initially there were two issues.
> 1. Upgrade Issue with EC Volume as described by you.
>  This is solved with the patch [1]. There was a bug in ctime posix
> where it was creating xattr even when ctime is not set on client (during
> utimes system call). With patch [1], the behavior
> is that utimes system call will only update the
> "trusted.glusterfs.mdata" xattr if present else it won't create. The new
> xattr creation should only happen during entry operations (i.e create,
> mknod and others).
>So there won't be any problems with upgrade. I think we don't need new
> option dependent on op version if I am not wrong.
>

If I'm not missing something, we cannot allow creation of mdata xattr even
for create/mknod/setattr fops. Doing so could cause the same problem if
some of the bricks are not upgraded and do not support mdata yet (or they
have ctime disabled by default).


> 2. After upgrade, how do we update "trusted.glusterfs.mdata"

Re: [Gluster-devel] Solving Ctime Issue with legacy files [BUG 1593542]

2019-06-17 Thread Xavi Hernandez
Hi Kotresh,

On Mon, Jun 17, 2019 at 1:50 PM Kotresh Hiremath Ravishankar <
khire...@redhat.com> wrote:

> Hi All,
>
> The ctime feature is enabled by default from release gluster-6. But as
> explained in bug [1]  there is a known issue with legacy files i.e., the
> files which are created before ctime feature is enabled. These files would
> not have "trusted.glusterfs.mdata" xattr which maintain time attributes. So
> on, accessing those files, it gets created with latest time attributes.
> This is not correct because all the time attributes (atime, mtime, ctime)
> get updated instead of required time attributes.
>
> There are couple of approaches to solve this.
>
> 1. On accessing the files, let the posix update the time attributes from
> the back end file on respective replicas. This obviously results in
> inconsistent "trusted.glusterfs.mdata" xattr values with in replica set.
> AFR/EC should heal this xattr as part of metadata heal upon accessing this
> file. It can chose to replicate from any subvolume. Ideally we should
> consider the highest time from the replica and treat it as source but I
> think that should be fine as replica time attributes are mostly in sync
> with max difference in order of few seconds if am not wrong.
>
>But client side self heal is disabled by default because of performance
> reasons [2]. If we chose to go by this approach, we need to consider
> enabling at least client side metadata self heal by default. Please share
> your thoughts on enabling the same by default.
>
> 2. Don't let posix update the legacy files from the backend. On lookup
> cbk, let the utime xlator update the time attributes from statbuf received
> synchronously.
>
> Both approaches are similar as both results in updating the xattr during
> lookup. Please share your inputs on which approach is better.
>

I prefer second approach. First approach is not feasible for EC volumes
because self-heal requires that k bricks (on a k+r configuration) agree on
the value of this xattr, otherwise it considers the metadata damaged and
needs manual intervention to fix it. During upgrade, first r bricks with be
upgraded without problems, but trusted.glusterfs.mdata won't be healed
because r < k. In fact this xattr will be removed from new bricks because
the majority of bricks agree on xattr not being present. Once the r+1 brick
is upgraded, it's possible that posix sets different values for
trusted.glusterfs.mdata, which will cause self-heal to fail.

Second approach seems better to me if guarded by a new option that enables
this behavior. utime xlator should only update the mdata xattr if that
option is set, and that option should only be settable once all nodes have
been upgraded (controlled by op-version). In this situation the first
lookup on a file where utime detects that mdata is not set, will require a
synchronous update. I think this is good enough because it will only happen
once per file. We'll need to consider cases where different clients do
lookups at the same time, but I think this can be easily solved by ignoring
the request if mdata is already present.

Xavi


>
>
> [1] https://bugzilla.redhat.com/show_bug.cgi?id=1593542
> [2] https://github.com/gluster/glusterfs/issues/473
>
> --
> Thanks and Regards,
> Kotresh H R
>
___

Community Meeting Calendar:

APAC Schedule -
Every 2nd and 4th Tuesday at 11:30 AM IST
Bridge: https://bluejeans.com/836554017

NA/EMEA Schedule -
Every 1st and 3rd Tuesday at 01:00 PM EDT
Bridge: https://bluejeans.com/486278655

Gluster-devel mailing list
Gluster-devel@gluster.org
https://lists.gluster.org/mailman/listinfo/gluster-devel



Re: [Gluster-devel] Should we enable contention notification by default ?

2019-06-06 Thread Xavi Hernandez
Missed the patch link: https://review.gluster.org/c/glusterfs/+/22828

On Thu, Jun 6, 2019 at 8:32 AM Xavi Hernandez  wrote:

> On Thu, May 2, 2019 at 5:45 PM Atin Mukherjee 
> wrote:
>
>>
>>
>> On Thu, 2 May 2019 at 20:38, Xavi Hernandez 
>> wrote:
>>
>>> On Thu, May 2, 2019 at 4:06 PM Atin Mukherjee <
>>> atin.mukherje...@gmail.com> wrote:
>>>
>>>>
>>>>
>>>> On Thu, 2 May 2019 at 19:14, Xavi Hernandez 
>>>> wrote:
>>>>
>>>>> On Thu, 2 May 2019, 15:37 Milind Changire, 
>>>>> wrote:
>>>>>
>>>>>> On Thu, May 2, 2019 at 6:44 PM Xavi Hernandez 
>>>>>> wrote:
>>>>>>
>>>>>>> Hi Ashish,
>>>>>>>
>>>>>>> On Thu, May 2, 2019 at 2:17 PM Ashish Pandey 
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Xavi,
>>>>>>>>
>>>>>>>> I would like to keep this option (features.lock-notify-contention)
>>>>>>>> enabled by default.
>>>>>>>> However, I can see that there is one more option which will impact
>>>>>>>> the working of this option which is "notify-contention-delay"
>>>>>>>>
>>>>>>>
>>>>>> Just a nit. I wish the option was called "notify-contention-interval"
>>>>>> The "delay" part doesn't really emphasize where the delay would be
>>>>>> put in.
>>>>>>
>>>>>
>>>>> It makes sense. Maybe we can also rename it or add a second name
>>>>> (alias). If there are no objections, I will send a patch with the change.
>>>>>
>>>>> Xavi
>>>>>
>>>>>
>>>>>>
>>>>>>>  .description = "This value determines the minimum amount of
>>>>>>>> time "
>>>>>>>> "(in seconds) between upcall contention
>>>>>>>> notifications "
>>>>>>>> "on the same inode. If multiple lock requests
>>>>>>>> are "
>>>>>>>> "received during this period, only one upcall
>>>>>>>> will "
>>>>>>>> "be sent."},
>>>>>>>>
>>>>>>>> I am not sure what should be the best value for this option if we
>>>>>>>> want to keep features.lock-notify-contention ON by default?
>>>>>>>> It looks like if we keep the value of notify-contention-delay more,
>>>>>>>> say 5 sec, it will wait for this much time to send up call
>>>>>>>> notification which does not look good.
>>>>>>>>
>>>>>>>
>>>>>>> No, the first notification is sent immediately. What this option
>>>>>>> does is to define the minimum interval between notifications. This 
>>>>>>> interval
>>>>>>> is per lock. This is done to avoid storms of notifications if many 
>>>>>>> requests
>>>>>>> come referencing the same lock.
>>>>>>>
>>>>>>> Is my understanding correct?
>>>>>>>> What will be impact of this value and what should be the default
>>>>>>>> value of this option?
>>>>>>>>
>>>>>>>
>>>>>>> I think the current default value of 5 seconds seems good enough. If
>>>>>>> there are many bricks, each brick could send a notification per lock. 
>>>>>>> 1000
>>>>>>> bricks would mean a client would receive 1000 notifications every 5
>>>>>>> seconds. It doesn't seem too much, but in those cases 10, and 
>>>>>>> considering
>>>>>>> we could have other locks, maybe a higher value could be better.
>>>>>>>
>>>>>>> Xavi
>>>>>>>
>>>>>>>
>>>>>>>>
>>>>>>>> ---
>>>>>>>> Ashish
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>&

Re: [Gluster-devel] Should we enable contention notification by default ?

2019-06-06 Thread Xavi Hernandez
On Thu, May 2, 2019 at 5:45 PM Atin Mukherjee 
wrote:

>
>
> On Thu, 2 May 2019 at 20:38, Xavi Hernandez  wrote:
>
>> On Thu, May 2, 2019 at 4:06 PM Atin Mukherjee 
>> wrote:
>>
>>>
>>>
>>> On Thu, 2 May 2019 at 19:14, Xavi Hernandez 
>>> wrote:
>>>
>>>> On Thu, 2 May 2019, 15:37 Milind Changire,  wrote:
>>>>
>>>>> On Thu, May 2, 2019 at 6:44 PM Xavi Hernandez 
>>>>> wrote:
>>>>>
>>>>>> Hi Ashish,
>>>>>>
>>>>>> On Thu, May 2, 2019 at 2:17 PM Ashish Pandey 
>>>>>> wrote:
>>>>>>
>>>>>>> Xavi,
>>>>>>>
>>>>>>> I would like to keep this option (features.lock-notify-contention)
>>>>>>> enabled by default.
>>>>>>> However, I can see that there is one more option which will impact
>>>>>>> the working of this option which is "notify-contention-delay"
>>>>>>>
>>>>>>
>>>>> Just a nit. I wish the option was called "notify-contention-interval"
>>>>> The "delay" part doesn't really emphasize where the delay would be put
>>>>> in.
>>>>>
>>>>
>>>> It makes sense. Maybe we can also rename it or add a second name
>>>> (alias). If there are no objections, I will send a patch with the change.
>>>>
>>>> Xavi
>>>>
>>>>
>>>>>
>>>>>>  .description = "This value determines the minimum amount of time
>>>>>>> "
>>>>>>> "(in seconds) between upcall contention
>>>>>>> notifications "
>>>>>>> "on the same inode. If multiple lock requests
>>>>>>> are "
>>>>>>> "received during this period, only one upcall
>>>>>>> will "
>>>>>>> "be sent."},
>>>>>>>
>>>>>>> I am not sure what should be the best value for this option if we
>>>>>>> want to keep features.lock-notify-contention ON by default?
>>>>>>> It looks like if we keep the value of notify-contention-delay more,
>>>>>>> say 5 sec, it will wait for this much time to send up call
>>>>>>> notification which does not look good.
>>>>>>>
>>>>>>
>>>>>> No, the first notification is sent immediately. What this option does
>>>>>> is to define the minimum interval between notifications. This interval is
>>>>>> per lock. This is done to avoid storms of notifications if many requests
>>>>>> come referencing the same lock.
>>>>>>
>>>>>> Is my understanding correct?
>>>>>>> What will be impact of this value and what should be the default
>>>>>>> value of this option?
>>>>>>>
>>>>>>
>>>>>> I think the current default value of 5 seconds seems good enough. If
>>>>>> there are many bricks, each brick could send a notification per lock. 
>>>>>> 1000
>>>>>> bricks would mean a client would receive 1000 notifications every 5
>>>>>> seconds. It doesn't seem too much, but in those cases 10, and considering
>>>>>> we could have other locks, maybe a higher value could be better.
>>>>>>
>>>>>> Xavi
>>>>>>
>>>>>>
>>>>>>>
>>>>>>> ---
>>>>>>> Ashish
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>> *From: *"Xavi Hernandez" 
>>>>>>> *To: *"gluster-devel" 
>>>>>>> *Cc: *"Pranith Kumar Karampuri" , "Ashish
>>>>>>> Pandey" , "Amar Tumballi" 
>>>>>>> *Sent: *Thursday, May 2, 2019 4:15:38 PM
>>>>>>> *Subject: *Should we enable contention notification by default ?
>>>>>>>
>>>>>>> Hi all,
>>>>>>>
>>>>>&g

Re: [Gluster-devel] Should we enable features.locks-notify.contention by default ?

2019-05-30 Thread Xavi Hernandez
On Thu, May 30, 2019 at 9:03 AM Ashish Pandey  wrote:

>
>
> I am only concerned about in-service upgrade.
> If a feature/option is not present in V1, then I would prefer not to
> enable it by default on V2.
>

The problem is that without enabling it, (other-)eager-lock will cause
performance issues in some cases. It doesn't seem good to keep an option
disabled if enabling it solves these problems.



> We have seen some problem in other-eager-lock when we changed it to enable
> by default.
>

Which problems ? I think the only issue with other-eager-lock has been
precisely that locks-notify-contention was disabled and a bug that needed
to be solved anyway.

The difference will be that upgraded bricks will start sending upcall
notifications. If clients are too old, these will simply be ignored. So I
don't see any problem right now.

Am I missing something ?


> ---
> Ashish
>
> --
> *From: *"Amar Tumballi Suryanarayan" 
> *To: *"Xavi Hernandez" 
> *Cc: *"gluster-devel" 
> *Sent: *Thursday, May 30, 2019 12:04:43 PM
> *Subject: *Re: [Gluster-devel] Should we enable
> features.locks-notify.contention by default ?
>
>
>
> On Thu, May 30, 2019 at 11:34 AM Xavi Hernandez 
> wrote:
>
>> Hi all,
>>
>> a patch [1] was added some time ago to send upcall notifications from the
>> locks xlator to the current owner of a granted lock when another client
>> tries to acquire the same lock (inodelk or entrylk). This makes it possible
>> to use eager-locking on the client side, which improves performance
>> significantly, while also keeping good performance when multiple clients
>> are accessing the same files (the current owner of the lock receives the
>> notification and releases it as soon as possible, allowing the other client
>> to acquire it and proceed very soon).
>>
>> Currently both AFR and EC are ready to handle these contention
>> notifications and both use eager-locking. However the upcall contention
>> notification is disabled by default.
>>
>> I think we should enabled it by default. Does anyone see any possible
>> issue if we do that ?
>>
>>
> If it helps performance, we should ideally do it.
>
> But, considering we are days away from glusterfs-7.0 branching, should we
> do it now, or wait for branch out, and make it default for next version?
> (so that it gets time for testing). Considering it is about consistency I
> would like to hear everyone's opinion here.
>
> Regards,
> Amar
>
>
>
>
>>
>> Regards,
>>
>> Xavi
>>
>> [1] https://review.gluster.org/c/glusterfs/+/14736
>> ___
>>
>>
> --
> Amar Tumballi (amarts)
>
> ___
>
> Community Meeting Calendar:
>
> APAC Schedule -
> Every 2nd and 4th Tuesday at 11:30 AM IST
> Bridge: https://bluejeans.com/836554017
>
> NA/EMEA Schedule -
> Every 1st and 3rd Tuesday at 01:00 PM EDT
> Bridge: https://bluejeans.com/486278655
>
> Gluster-devel mailing list
> Gluster-devel@gluster.org
> https://lists.gluster.org/mailman/listinfo/gluster-devel
>
>
>
___

Community Meeting Calendar:

APAC Schedule -
Every 2nd and 4th Tuesday at 11:30 AM IST
Bridge: https://bluejeans.com/836554017

NA/EMEA Schedule -
Every 1st and 3rd Tuesday at 01:00 PM EDT
Bridge: https://bluejeans.com/486278655

Gluster-devel mailing list
Gluster-devel@gluster.org
https://lists.gluster.org/mailman/listinfo/gluster-devel



[Gluster-devel] Should we enable features.locks-notify.contention by default ?

2019-05-30 Thread Xavi Hernandez
Hi all,

a patch [1] was added some time ago to send upcall notifications from the
locks xlator to the current owner of a granted lock when another client
tries to acquire the same lock (inodelk or entrylk). This makes it possible
to use eager-locking on the client side, which improves performance
significantly, while also keeping good performance when multiple clients
are accessing the same files (the current owner of the lock receives the
notification and releases it as soon as possible, allowing the other client
to acquire it and proceed very soon).

Currently both AFR and EC are ready to handle these contention
notifications and both use eager-locking. However the upcall contention
notification is disabled by default.

I think we should enabled it by default. Does anyone see any possible issue
if we do that ?

Regards,

Xavi

[1] https://review.gluster.org/c/glusterfs/+/14736
___

Community Meeting Calendar:

APAC Schedule -
Every 2nd and 4th Tuesday at 11:30 AM IST
Bridge: https://bluejeans.com/836554017

NA/EMEA Schedule -
Every 1st and 3rd Tuesday at 01:00 PM EDT
Bridge: https://bluejeans.com/486278655

Gluster-devel mailing list
Gluster-devel@gluster.org
https://lists.gluster.org/mailman/listinfo/gluster-devel



Re: [Gluster-devel] Coverity scan - how does it ignore dismissed defects & annotations?

2019-05-03 Thread Xavi Hernandez
Hi Atin,

On Fri, May 3, 2019 at 10:57 AM Atin Mukherjee  wrote:

> I'm bit puzzled on the way coverity is reporting the open defects on GD1
> component. As you can see from [1], technically we have 6 open defects and
> all of the rest are being marked as dismissed. We tried to put some
> additional annotations in the code through [2] to see if coverity starts
> feeling happy but the result doesn't change. I still see in the report it
> complaints about open defect of GD1 as 25 (7 as High, 18 as medium and 1 as
> Low). More interestingly yesterday's report claimed we fixed 8 defects,
> introduced 1, but the overall count remained as 102. I'm not able to
> connect the dots of this puzzle, can anyone?
>

Maybe we need to modify all dismissed CID's so that Coverity considers them
again and, hopefully, mark them as solved with the newer updates. They have
been manually marked to be ignored, so they are still there...

Just a thought, I'm not sure how this really works.

Xavi


>
> [1] https://scan.coverity.com/projects/gluster-glusterfs/view_defects
> [2] https://review.gluster.org/#/c/22619/
> ___
> Gluster-devel mailing list
> Gluster-devel@gluster.org
> https://lists.gluster.org/mailman/listinfo/gluster-devel
___
Gluster-devel mailing list
Gluster-devel@gluster.org
https://lists.gluster.org/mailman/listinfo/gluster-devel

[Gluster-devel] Weird performance behavior

2019-05-02 Thread Xavi Hernandez
Hi,

doing some tests to compare performance I've found some weird results. I've
seen this in different tests, but probably the more clear an easier to
reproduce is to use smallfile tool to create files.

The test command is:

# python smallfile_cli.py --operation create --files-per-dir 100
--file-size 32768 --threads 16 --files 256 --top  --stonewall no


I've run this test 5 times sequentially using the same initial conditions
(at least this is what I think): bricks cleared, all gluster processes
stopped, volume destroyed and recreated, caches emptied.

This is the data I've obtained for each execution:

Time us  sy ni id wa hi si st read
  write  use
 435   1.803.70   0.00  81.62  11.06   0.00   0.00   0.00   32.931
 608715.575   97.632
 450   1.673.62   0.00  80.67  12.19   0.00   0.00   0.00   30.989
 589078.308   97.714
 425   1.743.75   0.00  81.85  10.76   0.00   0.00   0.00   37.588
 622034.812   97.706
 320   2.475.06   0.00  82.84   7.75   0.00   0.00   0.00   46.406
 828637.359   96.891
 365   2.194.44   0.00  84.45   7.12   0.00   0.00   0.00   45.822
 734566.685   97.466


Time is in seconds. us, sy, ni, id, wa, hi, si and st are the CPU times, as
reported by top. read and write are the disk throughput in KiB/s. use is
the disk usage percentage.

Based on this we can see that there's a big difference between the best and
the worst cases. But it seems more relevant that when it performed better,
in fact disk utilization and CPU wait time were a bit lower.

Disk is a NVMe and I used a recent commit from master (2b86da69). Volume
type is a replica 3 with 3 bricks.

I'm not sure what can be causing this. Any idea ? can anyone try to
reproduce it to see if it's a problem in my environment or it's a common
problem ?

Thanks,

Xavi
___
Gluster-devel mailing list
Gluster-devel@gluster.org
https://lists.gluster.org/mailman/listinfo/gluster-devel

Re: [Gluster-devel] Should we enable contention notification by default ?

2019-05-02 Thread Xavi Hernandez
On Thu, May 2, 2019 at 4:06 PM Atin Mukherjee 
wrote:

>
>
> On Thu, 2 May 2019 at 19:14, Xavi Hernandez  wrote:
>
>> On Thu, 2 May 2019, 15:37 Milind Changire,  wrote:
>>
>>> On Thu, May 2, 2019 at 6:44 PM Xavi Hernandez 
>>> wrote:
>>>
>>>> Hi Ashish,
>>>>
>>>> On Thu, May 2, 2019 at 2:17 PM Ashish Pandey 
>>>> wrote:
>>>>
>>>>> Xavi,
>>>>>
>>>>> I would like to keep this option (features.lock-notify-contention)
>>>>> enabled by default.
>>>>> However, I can see that there is one more option which will impact the
>>>>> working of this option which is "notify-contention-delay"
>>>>>
>>>>
>>> Just a nit. I wish the option was called "notify-contention-interval"
>>> The "delay" part doesn't really emphasize where the delay would be put
>>> in.
>>>
>>
>> It makes sense. Maybe we can also rename it or add a second name (alias).
>> If there are no objections, I will send a patch with the change.
>>
>> Xavi
>>
>>
>>>
>>>>  .description = "This value determines the minimum amount of time "
>>>>> "(in seconds) between upcall contention
>>>>> notifications "
>>>>> "on the same inode. If multiple lock requests are "
>>>>> "received during this period, only one upcall will
>>>>> "
>>>>> "be sent."},
>>>>>
>>>>> I am not sure what should be the best value for this option if we want
>>>>> to keep features.lock-notify-contention ON by default?
>>>>> It looks like if we keep the value of notify-contention-delay more,
>>>>> say 5 sec, it will wait for this much time to send up call
>>>>> notification which does not look good.
>>>>>
>>>>
>>>> No, the first notification is sent immediately. What this option does
>>>> is to define the minimum interval between notifications. This interval is
>>>> per lock. This is done to avoid storms of notifications if many requests
>>>> come referencing the same lock.
>>>>
>>>> Is my understanding correct?
>>>>> What will be impact of this value and what should be the default value
>>>>> of this option?
>>>>>
>>>>
>>>> I think the current default value of 5 seconds seems good enough. If
>>>> there are many bricks, each brick could send a notification per lock. 1000
>>>> bricks would mean a client would receive 1000 notifications every 5
>>>> seconds. It doesn't seem too much, but in those cases 10, and considering
>>>> we could have other locks, maybe a higher value could be better.
>>>>
>>>> Xavi
>>>>
>>>>
>>>>>
>>>>> ---
>>>>> Ashish
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> *From: *"Xavi Hernandez" 
>>>>> *To: *"gluster-devel" 
>>>>> *Cc: *"Pranith Kumar Karampuri" , "Ashish
>>>>> Pandey" , "Amar Tumballi" 
>>>>> *Sent: *Thursday, May 2, 2019 4:15:38 PM
>>>>> *Subject: *Should we enable contention notification by default ?
>>>>>
>>>>> Hi all,
>>>>>
>>>>> there's a feature in the locks xlator that sends a notification to
>>>>> current owner of a lock when another client tries to acquire the same 
>>>>> lock.
>>>>> This way the current owner is made aware of the contention and can release
>>>>> the lock as soon as possible to allow the other client to proceed.
>>>>>
>>>>> This is specially useful when eager-locking is used and multiple
>>>>> clients access the same files and directories. Currently both replicated
>>>>> and dispersed volumes use eager-locking and can use contention 
>>>>> notification
>>>>> to force an early release of the lock.
>>>>>
>>>>> Eager-locking reduces the number of network requests required for each
>>>>> operation, improving performance, but could add delays to other cli

Re: [Gluster-devel] Should we enable contention notification by default ?

2019-05-02 Thread Xavi Hernandez
On Thu, 2 May 2019, 15:37 Milind Changire,  wrote:

> On Thu, May 2, 2019 at 6:44 PM Xavi Hernandez 
> wrote:
>
>> Hi Ashish,
>>
>> On Thu, May 2, 2019 at 2:17 PM Ashish Pandey  wrote:
>>
>>> Xavi,
>>>
>>> I would like to keep this option (features.lock-notify-contention)
>>> enabled by default.
>>> However, I can see that there is one more option which will impact the
>>> working of this option which is "notify-contention-delay"
>>>
>>
> Just a nit. I wish the option was called "notify-contention-interval"
> The "delay" part doesn't really emphasize where the delay would be put in.
>

It makes sense. Maybe we can also rename it or add a second name (alias).
If there are no objections, I will send a patch with the change.

Xavi


>
>>  .description = "This value determines the minimum amount of time "
>>> "(in seconds) between upcall contention
>>> notifications "
>>> "on the same inode. If multiple lock requests are "
>>> "received during this period, only one upcall will "
>>> "be sent."},
>>>
>>> I am not sure what should be the best value for this option if we want
>>> to keep features.lock-notify-contention ON by default?
>>> It looks like if we keep the value of notify-contention-delay more, say
>>> 5 sec, it will wait for this much time to send up call
>>> notification which does not look good.
>>>
>>
>> No, the first notification is sent immediately. What this option does is
>> to define the minimum interval between notifications. This interval is per
>> lock. This is done to avoid storms of notifications if many requests come
>> referencing the same lock.
>>
>> Is my understanding correct?
>>> What will be impact of this value and what should be the default value
>>> of this option?
>>>
>>
>> I think the current default value of 5 seconds seems good enough. If
>> there are many bricks, each brick could send a notification per lock. 1000
>> bricks would mean a client would receive 1000 notifications every 5
>> seconds. It doesn't seem too much, but in those cases 10, and considering
>> we could have other locks, maybe a higher value could be better.
>>
>> Xavi
>>
>>
>>>
>>> ---
>>> Ashish
>>>
>>>
>>>
>>>
>>>
>>>
>>> --
>>> *From: *"Xavi Hernandez" 
>>> *To: *"gluster-devel" 
>>> *Cc: *"Pranith Kumar Karampuri" , "Ashish Pandey" <
>>> aspan...@redhat.com>, "Amar Tumballi" 
>>> *Sent: *Thursday, May 2, 2019 4:15:38 PM
>>> *Subject: *Should we enable contention notification by default ?
>>>
>>> Hi all,
>>>
>>> there's a feature in the locks xlator that sends a notification to
>>> current owner of a lock when another client tries to acquire the same lock.
>>> This way the current owner is made aware of the contention and can release
>>> the lock as soon as possible to allow the other client to proceed.
>>>
>>> This is specially useful when eager-locking is used and multiple clients
>>> access the same files and directories. Currently both replicated and
>>> dispersed volumes use eager-locking and can use contention notification to
>>> force an early release of the lock.
>>>
>>> Eager-locking reduces the number of network requests required for each
>>> operation, improving performance, but could add delays to other clients
>>> while it keeps the inode or entry locked. With the contention notification
>>> feature we avoid this delay, so we get the best performance with minimal
>>> issues in multiclient environments.
>>>
>>> Currently the contention notification feature is controlled by the
>>> 'features.lock-notify-contention' option and it's disabled by default.
>>> Should we enable it by default ?
>>>
>>> I don't see any reason to keep it disabled by default. Does anyone
>>> foresee any problem ?
>>>
>>> Regards,
>>>
>>> Xavi
>>>
>>> ___
>> Gluster-devel mailing list
>> Gluster-devel@gluster.org
>> https://lists.gluster.org/mailman/listinfo/gluster-devel
>
>
>
> --
> Milind
>
>
___
Gluster-devel mailing list
Gluster-devel@gluster.org
https://lists.gluster.org/mailman/listinfo/gluster-devel

Re: [Gluster-devel] Should we enable contention notification by default ?

2019-05-02 Thread Xavi Hernandez
Hi Ashish,

On Thu, May 2, 2019 at 2:17 PM Ashish Pandey  wrote:

> Xavi,
>
> I would like to keep this option (features.lock-notify-contention) enabled
> by default.
> However, I can see that there is one more option which will impact the
> working of this option which is "notify-contention-delay"
>  .description = "This value determines the minimum amount of time "
> "(in seconds) between upcall contention notifications "
> "on the same inode. If multiple lock requests are "
> "received during this period, only one upcall will "
> "be sent."},
>
> I am not sure what should be the best value for this option if we want to
> keep features.lock-notify-contention ON by default?
> It looks like if we keep the value of notify-contention-delay more, say 5
> sec, it will wait for this much time to send up call
> notification which does not look good.
>

No, the first notification is sent immediately. What this option does is to
define the minimum interval between notifications. This interval is per
lock. This is done to avoid storms of notifications if many requests come
referencing the same lock.

Is my understanding correct?
> What will be impact of this value and what should be the default value of
> this option?
>

I think the current default value of 5 seconds seems good enough. If there
are many bricks, each brick could send a notification per lock. 1000 bricks
would mean a client would receive 1000 notifications every 5 seconds. It
doesn't seem too much, but in those cases 10, and considering we could have
other locks, maybe a higher value could be better.

Xavi


>
> ---
> Ashish
>
>
>
>
>
>
> --
> *From: *"Xavi Hernandez" 
> *To: *"gluster-devel" 
> *Cc: *"Pranith Kumar Karampuri" , "Ashish Pandey" <
> aspan...@redhat.com>, "Amar Tumballi" 
> *Sent: *Thursday, May 2, 2019 4:15:38 PM
> *Subject: *Should we enable contention notification by default ?
>
> Hi all,
>
> there's a feature in the locks xlator that sends a notification to current
> owner of a lock when another client tries to acquire the same lock. This
> way the current owner is made aware of the contention and can release the
> lock as soon as possible to allow the other client to proceed.
>
> This is specially useful when eager-locking is used and multiple clients
> access the same files and directories. Currently both replicated and
> dispersed volumes use eager-locking and can use contention notification to
> force an early release of the lock.
>
> Eager-locking reduces the number of network requests required for each
> operation, improving performance, but could add delays to other clients
> while it keeps the inode or entry locked. With the contention notification
> feature we avoid this delay, so we get the best performance with minimal
> issues in multiclient environments.
>
> Currently the contention notification feature is controlled by the
> 'features.lock-notify-contention' option and it's disabled by default.
> Should we enable it by default ?
>
> I don't see any reason to keep it disabled by default. Does anyone foresee
> any problem ?
>
> Regards,
>
> Xavi
>
>
___
Gluster-devel mailing list
Gluster-devel@gluster.org
https://lists.gluster.org/mailman/listinfo/gluster-devel

[Gluster-devel] Should we enable contention notification by default ?

2019-05-02 Thread Xavi Hernandez
Hi all,

there's a feature in the locks xlator that sends a notification to current
owner of a lock when another client tries to acquire the same lock. This
way the current owner is made aware of the contention and can release the
lock as soon as possible to allow the other client to proceed.

This is specially useful when eager-locking is used and multiple clients
access the same files and directories. Currently both replicated and
dispersed volumes use eager-locking and can use contention notification to
force an early release of the lock.

Eager-locking reduces the number of network requests required for each
operation, improving performance, but could add delays to other clients
while it keeps the inode or entry locked. With the contention notification
feature we avoid this delay, so we get the best performance with minimal
issues in multiclient environments.

Currently the contention notification feature is controlled by the
'features.lock-notify-contention' option and it's disabled by default.
Should we enable it by default ?

I don't see any reason to keep it disabled by default. Does anyone foresee
any problem ?

Regards,

Xavi
___
Gluster-devel mailing list
Gluster-devel@gluster.org
https://lists.gluster.org/mailman/listinfo/gluster-devel

Re: [Gluster-devel] test failure reports for last 15 days

2019-04-15 Thread Xavi Hernandez
On Mon, Apr 15, 2019 at 11:08 AM Pranith Kumar Karampuri <
pkara...@redhat.com> wrote:

>
>
> On Thu, Apr 11, 2019 at 2:59 PM Xavi Hernandez 
> wrote:
>
>> On Wed, Apr 10, 2019 at 7:25 PM Xavi Hernandez 
>> wrote:
>>
>>> On Wed, Apr 10, 2019 at 4:01 PM Atin Mukherjee 
>>> wrote:
>>>
>>>> And now for last 15 days:
>>>>
>>>>
>>>> https://fstat.gluster.org/summary?start_date=2019-03-25_date=2019-04-10
>>>>
>>>> ./tests/bitrot/bug-1373520.t 18  ==> Fixed through
>>>> https://review.gluster.org/#/c/glusterfs/+/22481/, I don't see this
>>>> failing in brick mux post 5th April
>>>> ./tests/bugs/ec/bug-1236065.t 17  ==> happens only in brick mux,
>>>> needs analysis.
>>>>
>>>
>>> I've identified the problem here, but not the cause yet. There's a stale
>>> inodelk acquired by a process that is already dead, which causes inodelk
>>> requests from self-heal and other processes to block.
>>>
>>> The reason why it seemed to block in random places is that all commands
>>> are executed with the working directory pointing to a gluster directory
>>> which needs healing after the initial tests. Because of the stale inodelk,
>>> when any application tries to open a file in the working directory, it's
>>> blocked.
>>>
>>> I'll investigate what causes this.
>>>
>>
>> I think I've found the problem. This is a fragment of the brick log that
>> includes script steps, connections and disconnections of brick 0, and lock
>> requests to the problematic lock:
>>
>> [2019-04-11 08:22:20.381398]:++
>> G_LOG:tests/bugs/ec/bug-1236065.t: TEST: 66 kill_brick patchy jahernan
>> /d/backends/patchy2 ++
>> [2019-04-11 08:22:22.532646]:++
>> G_LOG:tests/bugs/ec/bug-1236065.t: TEST: 67 kill_brick patchy jahernan
>> /d/backends/patchy3 ++
>> [2019-04-11 08:22:23.709655] I [MSGID: 115029]
>> [server-handshake.c:550:server_setvolume] 0-patchy-server: accepted client
>> from
>> CTX_ID:1c2952c2-e90f-4631-8712-170b8c05aa6e-GRAPH_ID:0-PID:28900-HOST:jahernan-PC_NAME:patchy-client-1-RECON_NO:-2
>> (version: 7dev) with subvol /d/backends/patchy1
>> [2019-04-11 08:22:23.792204] I [common.c:234:pl_trace_in] 8-patchy-locks:
>> [REQUEST] Locker = {Pid=29710, lk-owner=68580998b47f,
>> Client=CTX_ID:1c2952c2-e90f-4631-8712-170b8c05aa6e-GRAPH_ID:0-PID:28900-HOST:jahernan-PC_NAME:patchy-client-1-RECON_NO:-2,
>> Frame=18676} Lockee = {gfid=35743386-b7c2-41c9-aafd-6b13de216704, fd=(nil),
>> path=/test} Lock = {lock=INODELK, cmd=SETLK, type=WRITE, domain:
>> patchy-disperse-0, start=0, len=0, pid=0}
>> [2019-04-11 08:22:23.792299] I [common.c:285:pl_trace_out]
>> 8-patchy-locks: [GRANTED] Locker = {Pid=29710, lk-owner=68580998b47f,
>> Client=CTX_ID:1c2952c2-e90f-4631-8712-170b8c05aa6e-GRAPH_ID:0-PID:28900-HOST:jahernan-PC_NAME:patchy-client-1-RECON_NO:-2,
>> Frame=18676} Lockee = {gfid=35743386-b7c2-41c9-aafd-6b13de216704, fd=(nil),
>> path=/test} Lock = {lock=INODELK, cmd=SETLK, type=WRITE, domain:
>> patchy-disperse-0, start=0, len=0, pid=0}
>> [2019-04-11 08:22:24.628478]:++
>> G_LOG:tests/bugs/ec/bug-1236065.t: TEST: 68 5 online_brick_count ++
>> [2019-04-11 08:22:26.097092]:++
>> G_LOG:tests/bugs/ec/bug-1236065.t: TEST: 70 rm -f 0.o 10.o 11.o 12.o 13.o
>> 14.o 15.o 16.o 17.o 18.o 19.o 1.o 2.o 3.o 4.o 5.o 6.o 7.o 8.o 9.o ++
>> [2019-04-11 08:22:26.333740]:++
>> G_LOG:tests/bugs/ec/bug-1236065.t: TEST: 71 ec_test_make ++
>> [2019-04-11 08:22:27.718963] I [MSGID: 115029]
>> [server-handshake.c:550:server_setvolume] 0-patchy-server: accepted client
>> from
>> CTX_ID:1c2952c2-e90f-4631-8712-170b8c05aa6e-GRAPH_ID:0-PID:28900-HOST:jahernan-PC_NAME:patchy-client-1-RECON_NO:-3
>> (version: 7dev) with subvol /d/backends/patchy1
>> [2019-04-11 08:22:27.801416] I [common.c:234:pl_trace_in] 8-patchy-locks:
>> [REQUEST] Locker = {Pid=29885, lk-owner=68580998b47f,
>> Client=CTX_ID:1c2952c2-e90f-4631-8712-170b8c05aa6e-GRAPH_ID:0-PID:28900-HOST:jahernan-PC_NAME:patchy-client-1-RECON_NO:-3,
>> Frame=19233} Lockee = {gfid=35743386-b7c2-41c9-aafd-6b13de216704, fd=(nil),
>> path=/test} Lock = {lock=INODELK, cmd=SETLK, type=UNLOCK, domain:
>> patchy-disperse-0, start=0, len=0, pid=0}
>> [2019-04-11 08:22:27.801434] E [inodelk.c:513:__inode_unlock_lock]
>> 8-patchy-locks:  Matching lock not found for unlock 0-9223372036854775807,
>> by 68580998b47f on 0x7f0ed0029190
>> 

[Gluster-devel] Possible issues with shared threads

2019-04-12 Thread Xavi Hernandez
Hi,

I've found some issues with memory accounting and I've written a patch [1]
to fix them. However during the tests I've found another problem:

In a brick-multiplexed environment, posix tries to start a single janitor
thread shared by all posix xlator instances, however there are two issues:

1. The creation is not atomic and it could happen that more than one
janitor thread is started (unless xlator init is serialized in some way)
2. Even though the thread is global, it's using information from a single
instance (through 'this'). This means that once the first instance of posix
xlator is stopped, 'this' can be destroyed, but the janitor will continue
using it. From the memory accounting point of view, it means that whatever
this thread does, is not tracked anymore.

Note that we only need to write a log message to access 'this' and use
dynamic memory.

I detected this problem in the posix xlator, but since there are other
threads that have been made global, maybe something similar could happen. I
think this need to be checked and fixed.

Xavi

[1] https://review.gluster.org/c/glusterfs/+/22554
___
Gluster-devel mailing list
Gluster-devel@gluster.org
https://lists.gluster.org/mailman/listinfo/gluster-devel

Re: [Gluster-devel] test failure reports for last 15 days

2019-04-11 Thread Xavi Hernandez
On Thu, Apr 11, 2019 at 11:28 AM Xavi Hernandez  wrote:

> On Wed, Apr 10, 2019 at 7:25 PM Xavi Hernandez 
> wrote:
>
>> On Wed, Apr 10, 2019 at 4:01 PM Atin Mukherjee 
>> wrote:
>>
>>> And now for last 15 days:
>>>
>>>
>>> https://fstat.gluster.org/summary?start_date=2019-03-25_date=2019-04-10
>>>
>>> ./tests/bitrot/bug-1373520.t 18  ==> Fixed through
>>> https://review.gluster.org/#/c/glusterfs/+/22481/, I don't see this
>>> failing in brick mux post 5th April
>>> ./tests/bugs/ec/bug-1236065.t 17  ==> happens only in brick mux,
>>> needs analysis.
>>>
>>
>> I've identified the problem here, but not the cause yet. There's a stale
>> inodelk acquired by a process that is already dead, which causes inodelk
>> requests from self-heal and other processes to block.
>>
>> The reason why it seemed to block in random places is that all commands
>> are executed with the working directory pointing to a gluster directory
>> which needs healing after the initial tests. Because of the stale inodelk,
>> when any application tries to open a file in the working directory, it's
>> blocked.
>>
>> I'll investigate what causes this.
>>
>
> I think I've found the problem. This is a fragment of the brick log that
> includes script steps, connections and disconnections of brick 0, and lock
> requests to the problematic lock:
>
> [2019-04-11 08:22:20.381398]:++ G_LOG:tests/bugs/ec/bug-1236065.t:
> TEST: 66 kill_brick patchy jahernan /d/backends/patchy2 ++
> [2019-04-11 08:22:22.532646]:++ G_LOG:tests/bugs/ec/bug-1236065.t:
> TEST: 67 kill_brick patchy jahernan /d/backends/patchy3 ++
> [2019-04-11 08:22:23.709655] I [MSGID: 115029]
> [server-handshake.c:550:server_setvolume] 0-patchy-server: accepted client
> from
> CTX_ID:1c2952c2-e90f-4631-8712-170b8c05aa6e-GRAPH_ID:0-PID:28900-HOST:jahernan-PC_NAME:patchy-client-1-RECON_NO:-2
> (version: 7dev) with subvol /d/backends/patchy1
> [2019-04-11 08:22:23.792204] I [common.c:234:pl_trace_in] 8-patchy-locks:
> [REQUEST] Locker = {Pid=29710, lk-owner=68580998b47f,
> Client=CTX_ID:1c2952c2-e90f-4631-8712-170b8c05aa6e-GRAPH_ID:0-PID:28900-HOST:jahernan-PC_NAME:patchy-client-1-RECON_NO:-2,
> Frame=18676} Lockee = {gfid=35743386-b7c2-41c9-aafd-6b13de216704, fd=(nil),
> path=/test} Lock = {lock=INODELK, cmd=SETLK, type=WRITE, domain:
> patchy-disperse-0, start=0, len=0, pid=0}
> [2019-04-11 08:22:23.792299] I [common.c:285:pl_trace_out] 8-patchy-locks:
> [GRANTED] Locker = {Pid=29710, lk-owner=68580998b47f,
> Client=CTX_ID:1c2952c2-e90f-4631-8712-170b8c05aa6e-GRAPH_ID:0-PID:28900-HOST:jahernan-PC_NAME:patchy-client-1-RECON_NO:-2,
> Frame=18676} Lockee = {gfid=35743386-b7c2-41c9-aafd-6b13de216704, fd=(nil),
> path=/test} Lock = {lock=INODELK, cmd=SETLK, type=WRITE, domain:
> patchy-disperse-0, start=0, len=0, pid=0}
> [2019-04-11 08:22:24.628478]:++ G_LOG:tests/bugs/ec/bug-1236065.t:
> TEST: 68 5 online_brick_count ++
> [2019-04-11 08:22:26.097092]:++ G_LOG:tests/bugs/ec/bug-1236065.t:
> TEST: 70 rm -f 0.o 10.o 11.o 12.o 13.o 14.o 15.o 16.o 17.o 18.o 19.o 1.o
> 2.o 3.o 4.o 5.o 6.o 7.o 8.o 9.o ++
> [2019-04-11 08:22:26.333740]:++ G_LOG:tests/bugs/ec/bug-1236065.t:
> TEST: 71 ec_test_make ++
> [2019-04-11 08:22:27.718963] I [MSGID: 115029]
> [server-handshake.c:550:server_setvolume] 0-patchy-server: accepted client
> from
> CTX_ID:1c2952c2-e90f-4631-8712-170b8c05aa6e-GRAPH_ID:0-PID:28900-HOST:jahernan-PC_NAME:patchy-client-1-RECON_NO:-3
> (version: 7dev) with subvol /d/backends/patchy1
> [2019-04-11 08:22:27.801416] I [common.c:234:pl_trace_in] 8-patchy-locks:
> [REQUEST] Locker = {Pid=29885, lk-owner=68580998b47f,
> Client=CTX_ID:1c2952c2-e90f-4631-8712-170b8c05aa6e-GRAPH_ID:0-PID:28900-HOST:jahernan-PC_NAME:patchy-client-1-RECON_NO:-3,
> Frame=19233} Lockee = {gfid=35743386-b7c2-41c9-aafd-6b13de216704, fd=(nil),
> path=/test} Lock = {lock=INODELK, cmd=SETLK, type=UNLOCK, domain:
> patchy-disperse-0, start=0, len=0, pid=0}
> [2019-04-11 08:22:27.801434] E [inodelk.c:513:__inode_unlock_lock]
> 8-patchy-locks:  Matching lock not found for unlock 0-9223372036854775807,
> by 68580998b47f on 0x7f0ed0029190
> [2019-04-11 08:22:27.801446] I [common.c:285:pl_trace_out] 8-patchy-locks:
> [Invalid argument] Locker = {Pid=29885, lk-owner=68580998b47f,
> Client=CTX_ID:1c2952c2-e90f-4631-8712-170b8c05aa6e-GRAPH_ID:0-PID:28900-HOST:jahernan-PC_NAME:patchy-client-1-RECON_NO:-3,
> Frame=19233} Lockee = {gfid=35743386-b7c2-41c9-aafd-6b13de216704, fd=(nil),
> path=/test} Lock = {lock=INODELK, cmd=SETLK, type=UNLOCK, domain:
> patchy-dis

Re: [Gluster-devel] test failure reports for last 15 days

2019-04-11 Thread Xavi Hernandez
On Wed, Apr 10, 2019 at 7:25 PM Xavi Hernandez  wrote:

> On Wed, Apr 10, 2019 at 4:01 PM Atin Mukherjee 
> wrote:
>
>> And now for last 15 days:
>>
>>
>> https://fstat.gluster.org/summary?start_date=2019-03-25_date=2019-04-10
>>
>> ./tests/bitrot/bug-1373520.t 18  ==> Fixed through
>> https://review.gluster.org/#/c/glusterfs/+/22481/, I don't see this
>> failing in brick mux post 5th April
>> ./tests/bugs/ec/bug-1236065.t 17  ==> happens only in brick mux,
>> needs analysis.
>>
>
> I've identified the problem here, but not the cause yet. There's a stale
> inodelk acquired by a process that is already dead, which causes inodelk
> requests from self-heal and other processes to block.
>
> The reason why it seemed to block in random places is that all commands
> are executed with the working directory pointing to a gluster directory
> which needs healing after the initial tests. Because of the stale inodelk,
> when any application tries to open a file in the working directory, it's
> blocked.
>
> I'll investigate what causes this.
>

I think I've found the problem. This is a fragment of the brick log that
includes script steps, connections and disconnections of brick 0, and lock
requests to the problematic lock:

[2019-04-11 08:22:20.381398]:++ G_LOG:tests/bugs/ec/bug-1236065.t:
TEST: 66 kill_brick patchy jahernan /d/backends/patchy2 ++
[2019-04-11 08:22:22.532646]:++ G_LOG:tests/bugs/ec/bug-1236065.t:
TEST: 67 kill_brick patchy jahernan /d/backends/patchy3 ++
[2019-04-11 08:22:23.709655] I [MSGID: 115029]
[server-handshake.c:550:server_setvolume] 0-patchy-server: accepted client
from
CTX_ID:1c2952c2-e90f-4631-8712-170b8c05aa6e-GRAPH_ID:0-PID:28900-HOST:jahernan-PC_NAME:patchy-client-1-RECON_NO:-2
(version: 7dev) with subvol /d/backends/patchy1
[2019-04-11 08:22:23.792204] I [common.c:234:pl_trace_in] 8-patchy-locks:
[REQUEST] Locker = {Pid=29710, lk-owner=68580998b47f,
Client=CTX_ID:1c2952c2-e90f-4631-8712-170b8c05aa6e-GRAPH_ID:0-PID:28900-HOST:jahernan-PC_NAME:patchy-client-1-RECON_NO:-2,
Frame=18676} Lockee = {gfid=35743386-b7c2-41c9-aafd-6b13de216704, fd=(nil),
path=/test} Lock = {lock=INODELK, cmd=SETLK, type=WRITE, domain:
patchy-disperse-0, start=0, len=0, pid=0}
[2019-04-11 08:22:23.792299] I [common.c:285:pl_trace_out] 8-patchy-locks:
[GRANTED] Locker = {Pid=29710, lk-owner=68580998b47f,
Client=CTX_ID:1c2952c2-e90f-4631-8712-170b8c05aa6e-GRAPH_ID:0-PID:28900-HOST:jahernan-PC_NAME:patchy-client-1-RECON_NO:-2,
Frame=18676} Lockee = {gfid=35743386-b7c2-41c9-aafd-6b13de216704, fd=(nil),
path=/test} Lock = {lock=INODELK, cmd=SETLK, type=WRITE, domain:
patchy-disperse-0, start=0, len=0, pid=0}
[2019-04-11 08:22:24.628478]:++ G_LOG:tests/bugs/ec/bug-1236065.t:
TEST: 68 5 online_brick_count ++
[2019-04-11 08:22:26.097092]:++ G_LOG:tests/bugs/ec/bug-1236065.t:
TEST: 70 rm -f 0.o 10.o 11.o 12.o 13.o 14.o 15.o 16.o 17.o 18.o 19.o 1.o
2.o 3.o 4.o 5.o 6.o 7.o 8.o 9.o ++
[2019-04-11 08:22:26.333740]:++ G_LOG:tests/bugs/ec/bug-1236065.t:
TEST: 71 ec_test_make ++
[2019-04-11 08:22:27.718963] I [MSGID: 115029]
[server-handshake.c:550:server_setvolume] 0-patchy-server: accepted client
from
CTX_ID:1c2952c2-e90f-4631-8712-170b8c05aa6e-GRAPH_ID:0-PID:28900-HOST:jahernan-PC_NAME:patchy-client-1-RECON_NO:-3
(version: 7dev) with subvol /d/backends/patchy1
[2019-04-11 08:22:27.801416] I [common.c:234:pl_trace_in] 8-patchy-locks:
[REQUEST] Locker = {Pid=29885, lk-owner=68580998b47f,
Client=CTX_ID:1c2952c2-e90f-4631-8712-170b8c05aa6e-GRAPH_ID:0-PID:28900-HOST:jahernan-PC_NAME:patchy-client-1-RECON_NO:-3,
Frame=19233} Lockee = {gfid=35743386-b7c2-41c9-aafd-6b13de216704, fd=(nil),
path=/test} Lock = {lock=INODELK, cmd=SETLK, type=UNLOCK, domain:
patchy-disperse-0, start=0, len=0, pid=0}
[2019-04-11 08:22:27.801434] E [inodelk.c:513:__inode_unlock_lock]
8-patchy-locks:  Matching lock not found for unlock 0-9223372036854775807,
by 68580998b47f on 0x7f0ed0029190
[2019-04-11 08:22:27.801446] I [common.c:285:pl_trace_out] 8-patchy-locks:
[Invalid argument] Locker = {Pid=29885, lk-owner=68580998b47f,
Client=CTX_ID:1c2952c2-e90f-4631-8712-170b8c05aa6e-GRAPH_ID:0-PID:28900-HOST:jahernan-PC_NAME:patchy-client-1-RECON_NO:-3,
Frame=19233} Lockee = {gfid=35743386-b7c2-41c9-aafd-6b13de216704, fd=(nil),
path=/test} Lock = {lock=INODELK, cmd=SETLK, type=UNLOCK, domain:
patchy-disperse-0, start=0, len=0, pid=0}

This is a fragment of the client log:

[2019-04-11 08:22:20.381398]:++ G_LOG:tests/bugs/ec/bug-1236065.t:
TEST: 66 kill_brick patchy jahernan /d/backends/patchy2 ++
[2019-04-11 08:22:20.675938] I [MSGID: 114018]
[client.c:2333:client_rpc_notify] 0-patchy-client-1: disconnected from
patchy-client-1. Client process will keep trying to connect to glusterd
until brick's port is available
[2019-04-11

Re: [Gluster-devel] test failure reports for last 15 days

2019-04-10 Thread Xavi Hernandez
On Wed, Apr 10, 2019 at 4:01 PM Atin Mukherjee  wrote:

> And now for last 15 days:
>
> https://fstat.gluster.org/summary?start_date=2019-03-25_date=2019-04-10
>
> ./tests/bitrot/bug-1373520.t 18  ==> Fixed through
> https://review.gluster.org/#/c/glusterfs/+/22481/, I don't see this
> failing in brick mux post 5th April
> ./tests/bugs/ec/bug-1236065.t 17  ==> happens only in brick mux, needs
> analysis.
>

I've identified the problem here, but not the cause yet. There's a stale
inodelk acquired by a process that is already dead, which causes inodelk
requests from self-heal and other processes to block.

The reason why it seemed to block in random places is that all commands are
executed with the working directory pointing to a gluster directory which
needs healing after the initial tests. Because of the stale inodelk, when
any application tries to open a file in the working directory, it's blocked.

I'll investigate what causes this.

Xavi

./tests/basic/uss.t 15  ==> happens in both brick mux and non
> brick mux runs, test just simply times out. Needs urgent analysis.
> ./tests/basic/ec/ec-fix-openfd.t 13  ==> Fixed through
> https://review.gluster.org/#/c/22508/ , patch merged today.
> ./tests/basic/volfile-sanity.t  8  ==> Some race, though this succeeds
> in second attempt every time.
>
> There're plenty more with 5 instances of failure from many tests. We need
> all maintainers/owners to look through these failures and fix them, we
> certainly don't want to get into a stage where master is unstable and we
> have to lock down the merges till all these failures are resolved. So
> please help.
>
> (Please note fstat stats show up the retries as failures too which in a
> way is right)
>
>
> On Tue, Feb 26, 2019 at 5:27 PM Atin Mukherjee 
> wrote:
>
>> [1] captures the test failures report since last 30 days and we'd need
>> volunteers/component owners to see why the number of failures are so high
>> against few tests.
>>
>> [1]
>> https://fstat.gluster.org/summary?start_date=2019-01-26_date=2019-02-25=all
>>
> ___
> Gluster-devel mailing list
> Gluster-devel@gluster.org
> https://lists.gluster.org/mailman/listinfo/gluster-devel
___
Gluster-devel mailing list
Gluster-devel@gluster.org
https://lists.gluster.org/mailman/listinfo/gluster-devel

Re: [Gluster-devel] Hello, I have a question about the erasure code translator, hope someone give me some advice, thank you!

2019-04-08 Thread Xavi Hernandez
Hi,

On Mon, Apr 8, 2019 at 8:50 AM PSC <1173701...@qq.com> wrote:

> Hi, I am a storage software coder who is interested in Gluster. I am
> trying to improve the read/write performance of it.
>
>
> I noticed that gluster is using Vandermonde matrix in erasure code
> encoding and decoding process. However, it is quite complicate to generate
> inverse matrix of a Vandermonde matrix, which is necessary for decode. The
> cost is O(n³).
>

That's not true, actually. A Vandermonde matrix can be inverted in O(n^2),
as the code currently does (look at ec_method_matrix_inverse() in
ec-method.c). Additionally, current code does caching of inverted matrices,
so in normal circumstances there shouldn't be many inverse computations.
Only when something changes (a brick dies or comes online), a new inverted
matrix could be needed.


>
> Use a Cauchy matrix, can greatly cut down the cost of the process to find
> an inverse matrix. Which is O(n²).
>
>
> I use intel storage accelerate library to replace the original ec
> encode/decode part of gluster. And it reduce the encode and decode time to
> about 50% of the original one.
>

How do you test that ? I also did some tests long ago and I didn't observe
that difference.

Doing a raw test of encoding/decoding performance of the current code using
Intel AVX2 extensions, it's able to process 7.6 GiB/s on a single core of
an Intel Xeon Silver 4114 when L1 cache is used. Without relying on
internal cache, it performs at 3.9 GiB/s. Does ISA-L provide better
performance for a matrix of the same size (4+2 non-systematic matrix) ?

>
> However, when I test the whole system. The read/write performance is
> almost the same as the original gluster.
>

Yes, there are many more things involved in the read and write operations
in gluster. For the particular case of EC, having to deal with many bricks
simultaneously (6 in this case) means that it's very sensitive to network
latency and communications delays, and this is probably one of the biggest
contributors. There some other small latencies added by other xlators.

>
> I test it on three machines as servers. Each one had two bricks, both of
> them are SSD. So the total amount of bricks is 6. Use two of them as coding
> bricks. That is a 4+2 disperse volume configure.
>
>
> The capability of network card is 1Mbps. Theoretically it can support
> read and write with the speed faster than 1000MB/s.
>
>
> The actually performance of read is about 492MB/s.
>
> The actually performance of write is about 336MB/s.
>
>
> While the original one read at 461MB/s, write at 322MB/s
>
>
> Is there someone who can give me some advice about how to improve its
> performance? Which part is the critical defect on its performance if it’s
> not the ec translator?
>
>
> I did a time count on translators. It show me EC translator just take 7%
> in the whole read\write process. Even though I knew that some translators
> are run asynchronous, so the real percentage can be some how lager than
> that.
>
>
> Sincerely thank you for your patient to read my question!
> ___
> Gluster-devel mailing list
> Gluster-devel@gluster.org
> https://lists.gluster.org/mailman/listinfo/gluster-devel
___
Gluster-devel mailing list
Gluster-devel@gluster.org
https://lists.gluster.org/mailman/listinfo/gluster-devel

Re: [Gluster-devel] Issue with posix locks

2019-04-01 Thread Xavi Hernandez
On Mon, Apr 1, 2019 at 10:15 AM Soumya Koduri  wrote:

>
>
> On 4/1/19 10:02 AM, Pranith Kumar Karampuri wrote:
> >
> >
> > On Sun, Mar 31, 2019 at 11:29 PM Soumya Koduri  > <mailto:skod...@redhat.com>> wrote:
> >
> >
> >
> > On 3/29/19 11:55 PM, Xavi Hernandez wrote:
> >  > Hi all,
> >  >
> >  > there is one potential problem with posix locks when used in a
> >  > replicated or dispersed volume.
> >  >
> >  > Some background:
> >  >
> >  > Posix locks allow any process to lock a region of a file multiple
> > times,
> >  > but a single unlock on a given region will release all previous
> > locks.
> >  > Locked regions can be different for each lock request and they can
> >  > overlap. The resulting lock will cover the union of all locked
> > regions.
> >  > A single unlock (the region doesn't necessarily need to match any
> > of the
> >  > ranges used for locking) will create a "hole" in the currently
> > locked
> >  > region, independently of how many times a lock request covered
> > that region.
> >  >
> >  > For this reason, the locks xlator simply combines the locked
> regions
> >  > that are requested, but it doesn't track each individual lock
> range.
> >  >
> >  > Under normal circumstances this works fine. But there are some
> cases
> >  > where this behavior is not sufficient. For example, suppose we
> > have a
> >  > replica 3 volume with quorum = 2. Given the special nature of
> posix
> >  > locks, AFR sends the lock request sequentially to each one of the
> >  > bricks, to avoid that conflicting lock requests from other
> > clients could
> >  > require to unlock an already locked region on the client that has
> > not
> >  > got enough successful locks (i.e. quorum). An unlock here not
> > only would
> >  > cancel the current lock request. It would also cancel any
> previously
> >  > acquired lock.
> >  >
> >
> > I may not have fully understood, please correct me. AFAIU, lk xlator
> > merges locks only if both the lk-owner and the client opaque matches.
> >
> > In the case which you have mentioned above, considering clientA
> > acquired
> > locks on majority of quorum (say nodeA and nodeB) and clientB on
> nodeC
> > alone- clientB now has to unlock/cancel the lock it acquired on
> nodeC.
> >
> > You are saying the it could pose a problem if there were already
> > successful locks taken by clientB for the same region which would get
> > unlocked by this particular unlock request..right?
> >
> > Assuming the previous locks acquired by clientB are shared (otherwise
> > clientA wouldn't have got granted lock for the same region on nodeA &
> > nodeB), they would still hold true on nodeA & nodeB  as the unlock
> > request was sent to only nodeC. Since the majority of quorum nodes
> > still
> > hold the locks by clientB, this isn't serious issue IMO.
> >
> > I haven't looked into heal part but would like to understand if this
> is
> > really an issue in normal scenarios as well.
> >
> >
> > This is how I understood the code. Consider the following case:
> > Nodes A, B, C have locks with start and end offsets: 5-15 from mount-1
> > and lock-range 2-3 from mount-2.
> > If mount-1 requests nonblocking lock with lock-range 1-7 and in parallel
> > lets say mount-2 issued unlock of 2-3 as well.
> >
> > nodeA got unlock from mount-2 with range 2-3 then lock from mount-1 with
> > range 1-7, so the lock is granted and merged to give 1-15
> > nodeB got lock from mount-1 with range 1-7 before unlock of 2-3 which
> > leads to EAGAIN which will trigger unlocks on granted lock in mount-1
> > which will end up doing unlock of 1-7 on nodeA leading to lock-range
> > 8-15 instead of the original 5-15 on nodeA. Whereas nodeB and nodeC will
> > have range 5-15.
> >
> > Let me know if my understanding is wrong.
>
> Both of us mentioned the same points. So in the example you gave ,
> mount-1 lost its previous lock on nodeA but majority of the quorum
> (nodeB and nodeC) still have the previous lock  (range: 5-15) intact. So
> this shouldn't ideally lead to any issues as other conflicting locks are
> blocked or failed by majority of the nodes (pr

Re: [Gluster-devel] Issue with posix locks

2019-04-01 Thread Xavi Hernandez
On Sun, Mar 31, 2019 at 7:59 PM Soumya Koduri  wrote:

>
>
> On 3/29/19 11:55 PM, Xavi Hernandez wrote:
> > Hi all,
> >
> > there is one potential problem with posix locks when used in a
> > replicated or dispersed volume.
> >
> > Some background:
> >
> > Posix locks allow any process to lock a region of a file multiple times,
> > but a single unlock on a given region will release all previous locks.
> > Locked regions can be different for each lock request and they can
> > overlap. The resulting lock will cover the union of all locked regions.
> > A single unlock (the region doesn't necessarily need to match any of the
> > ranges used for locking) will create a "hole" in the currently locked
> > region, independently of how many times a lock request covered that
> region.
> >
> > For this reason, the locks xlator simply combines the locked regions
> > that are requested, but it doesn't track each individual lock range.
> >
> > Under normal circumstances this works fine. But there are some cases
> > where this behavior is not sufficient. For example, suppose we have a
> > replica 3 volume with quorum = 2. Given the special nature of posix
> > locks, AFR sends the lock request sequentially to each one of the
> > bricks, to avoid that conflicting lock requests from other clients could
> > require to unlock an already locked region on the client that has not
> > got enough successful locks (i.e. quorum). An unlock here not only would
> > cancel the current lock request. It would also cancel any previously
> > acquired lock.
> >
>
> I may not have fully understood, please correct me. AFAIU, lk xlator
> merges locks only if both the lk-owner and the client opaque matches.
>
> In the case which you have mentioned above, considering clientA acquired
> locks on majority of quorum (say nodeA and nodeB) and clientB on nodeC
> alone- clientB now has to unlock/cancel the lock it acquired on nodeC.
>
> You are saying the it could pose a problem if there were already
> successful locks taken by clientB for the same region which would get
> unlocked by this particular unlock request..right?
>

Yes


>
> Assuming the previous locks acquired by clientB are shared (otherwise
> clientA wouldn't have got granted lock for the same region on nodeA &
> nodeB), they would still hold true on nodeA & nodeB  as the unlock
> request was sent to only nodeC. Since the majority of quorum nodes still
> hold the locks by clientB, this isn't serious issue IMO.
>

Partially true. But if one of nodeA or nodeB dies or gets disconnected,
there won't be any majority of bricks with correct locks, even though there
are still 2 alive bricks. At this point, another client could successfully
acquire a lock that, in theory, is already acquired by another client.


> I haven't looked into heal part but would like to understand if this is
> really an issue in normal scenarios as well.
>

If we consider that a brick disconnection is a normal scenario (which I
think it should be on a large scale distributed file system), then this
issue exists. But even without brick disconnections we can get incorrect
results, as Pranith has just explained.

Xavi


>
> Thanks,
> Soumya
>
> > However, when something goes wrong (a brick dies during a lock request,
> > or there's a network partition or some other weird situation), it could
> > happen that even using sequential locking, only one brick succeeds the
> > lock request. In this case, AFR should cancel the previous lock (and it
> > does), but this also cancels any previously acquired lock on that
> > region, which is not good.
> >
> > A similar thing can happen if we try to recover (heal) posix locks that
> > were active after a brick has been disconnected (for any reason) and
> > then reconnected.
> >
> > To fix all these situations we need to change the way posix locks are
> > managed by locks xlator. One possibility would be to embed the lock
> > request inside an inode transaction using inodelk. Since inodelks do not
> > suffer this problem, the follwing posix lock could be sent safely.
> > However this implies an additional network request, which could cause
> > some performance impact. Eager-locking could minimize the impact in some
> > cases. However this approach won't work for lock recovery after a
> > disconnect.
> >
> > Another possibility is to send a special partial posix lock request
> > which won't be immediately merged with already existing locks once
> > granted. An additional confirmation request of the partial posix lock
> > will be required to fully grant the current lock a

[Gluster-devel] Issue with posix locks

2019-03-29 Thread Xavi Hernandez
Hi all,

there is one potential problem with posix locks when used in a replicated
or dispersed volume.

Some background:

Posix locks allow any process to lock a region of a file multiple times,
but a single unlock on a given region will release all previous locks.
Locked regions can be different for each lock request and they can overlap.
The resulting lock will cover the union of all locked regions. A single
unlock (the region doesn't necessarily need to match any of the ranges used
for locking) will create a "hole" in the currently locked region,
independently of how many times a lock request covered that region.

For this reason, the locks xlator simply combines the locked regions that
are requested, but it doesn't track each individual lock range.

Under normal circumstances this works fine. But there are some cases where
this behavior is not sufficient. For example, suppose we have a replica 3
volume with quorum = 2. Given the special nature of posix locks, AFR sends
the lock request sequentially to each one of the bricks, to avoid that
conflicting lock requests from other clients could require to unlock an
already locked region on the client that has not got enough successful
locks (i.e. quorum). An unlock here not only would cancel the current lock
request. It would also cancel any previously acquired lock.

However, when something goes wrong (a brick dies during a lock request, or
there's a network partition or some other weird situation), it could happen
that even using sequential locking, only one brick succeeds the lock
request. In this case, AFR should cancel the previous lock (and it does),
but this also cancels any previously acquired lock on that region, which is
not good.

A similar thing can happen if we try to recover (heal) posix locks that
were active after a brick has been disconnected (for any reason) and then
reconnected.

To fix all these situations we need to change the way posix locks are
managed by locks xlator. One possibility would be to embed the lock request
inside an inode transaction using inodelk. Since inodelks do not suffer
this problem, the follwing posix lock could be sent safely. However this
implies an additional network request, which could cause some performance
impact. Eager-locking could minimize the impact in some cases. However this
approach won't work for lock recovery after a disconnect.

Another possibility is to send a special partial posix lock request which
won't be immediately merged with already existing locks once granted. An
additional confirmation request of the partial posix lock will be required
to fully grant the current lock and merge it with the existing ones. This
requires a new network request, which will add latency, and makes
everything more complex since there would be more combinations of states in
which something could fail.

So I think one possible solution would be the following:

1. Keep each posix lock as an independent object in locks xlator. This will
make it possible to "invalidate" any already granted lock without affecting
already established locks.

2. Additionally, we'll keep a sorted list of non-overlapping segments of
locked regions. And we'll count, for each region, how many locks are
referencing it. One lock can reference multiple segments, and each segment
can be referenced by multiple locks.

3. An additional lock request that overlaps with an existing segment, can
cause this segment to be split to satisfy the non-overlapping property.

4. When an unlock request is received, all segments intersecting with the
region are eliminated (it may require some segment splits on the edges),
and the unlocked region is subtracted from each lock associated to the
segment. If a lock gets an empty region, it's removed.

5. We'll create a special "remove lock" request that doesn't unlock a
region but removes an already granted lock. This will decrease the number
of references to each of the segments this lock was covering. If some
segment reaches 0, it's removed. Otherwise it remains there. This special
request will only be used internally to cancel already acquired locks that
cannot be fully granted due to quorum issues or any other problem.

In some weird cases, the list of segments can be huge (many locks
overlapping only on a single byte, so each segment represents only one
byte). We can try to find some smarter structure that minimizes this
problem or limit the number of segments (for example returning ENOLCK when
there are too many).

What do you think ?

Xavi
___
Gluster-devel mailing list
Gluster-devel@gluster.org
https://lists.gluster.org/mailman/listinfo/gluster-devel

Re: [Gluster-devel] [Gluster-users] POSIX locks and disconnections between clients and bricks

2019-03-28 Thread Xavi Hernandez
On Thu, Mar 28, 2019 at 3:05 AM Raghavendra Gowdappa 
wrote:

>
>
> On Wed, Mar 27, 2019 at 8:38 PM Xavi Hernandez 
> wrote:
>
>> On Wed, Mar 27, 2019 at 2:20 PM Pranith Kumar Karampuri <
>> pkara...@redhat.com> wrote:
>>
>>>
>>>
>>> On Wed, Mar 27, 2019 at 6:38 PM Xavi Hernandez 
>>> wrote:
>>>
>>>> On Wed, Mar 27, 2019 at 1:13 PM Pranith Kumar Karampuri <
>>>> pkara...@redhat.com> wrote:
>>>>
>>>>>
>>>>>
>>>>> On Wed, Mar 27, 2019 at 5:13 PM Xavi Hernandez 
>>>>> wrote:
>>>>>
>>>>>> On Wed, Mar 27, 2019 at 11:52 AM Raghavendra Gowdappa <
>>>>>> rgowd...@redhat.com> wrote:
>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On Wed, Mar 27, 2019 at 12:56 PM Xavi Hernandez 
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Hi Raghavendra,
>>>>>>>>
>>>>>>>> On Wed, Mar 27, 2019 at 2:49 AM Raghavendra Gowdappa <
>>>>>>>> rgowd...@redhat.com> wrote:
>>>>>>>>
>>>>>>>>> All,
>>>>>>>>>
>>>>>>>>> Glusterfs cleans up POSIX locks held on an fd when the
>>>>>>>>> client/mount through which those locks are held disconnects from
>>>>>>>>> bricks/server. This helps Glusterfs to not run into a stale lock 
>>>>>>>>> problem
>>>>>>>>> later (For eg., if application unlocks while the connection was still
>>>>>>>>> down). However, this means the lock is no longer exclusive as other
>>>>>>>>> applications/clients can acquire the same lock. To communicate that 
>>>>>>>>> locks
>>>>>>>>> are no longer valid, we are planning to mark the fd (which has POSIX 
>>>>>>>>> locks)
>>>>>>>>> bad on a disconnect so that any future operations on that fd will 
>>>>>>>>> fail,
>>>>>>>>> forcing the application to re-open the fd and re-acquire locks it 
>>>>>>>>> needs [1].
>>>>>>>>>
>>>>>>>>
>>>>>>>> Wouldn't it be better to retake the locks when the brick is
>>>>>>>> reconnected if the lock is still in use ?
>>>>>>>>
>>>>>>>
>>>>>>> There is also  a possibility that clients may never reconnect.
>>>>>>> That's the primary reason why bricks assume the worst (client will not
>>>>>>> reconnect) and cleanup the locks.
>>>>>>>
>>>>>>
>>>>>> True, so it's fine to cleanup the locks. I'm not saying that locks
>>>>>> shouldn't be released on disconnect. The assumption is that if the client
>>>>>> has really died, it will also disconnect from other bricks, who will
>>>>>> release the locks. So, eventually, another client will have enough quorum
>>>>>> to attempt a lock that will succeed. In other words, if a client gets
>>>>>> disconnected from too many bricks simultaneously (loses Quorum), then 
>>>>>> that
>>>>>> client can be considered as bad and can return errors to the application.
>>>>>> This should also cause to release the locks on the remaining connected
>>>>>> bricks.
>>>>>>
>>>>>> On the other hand, if the disconnection is very short and the client
>>>>>> has not died, it will keep enough locked files (it has quorum) to avoid
>>>>>> other clients to successfully acquire a lock. In this case, if the brick 
>>>>>> is
>>>>>> reconnected, all existing locks should be reacquired to recover the
>>>>>> original state before the disconnection.
>>>>>>
>>>>>>
>>>>>>>
>>>>>>>> BTW, the referenced bug is not public. Should we open another bug
>>>>>>>> to track this ?
>>>>>>>>
>>>>>>>
>>>>>>> I've just opened up the comment to give enough context. I'll open a
>>>>>>> bug upstream too.
>>>>>>>
>>>>>>>
>>>

Re: [Gluster-devel] [Gluster-users] POSIX locks and disconnections between clients and bricks

2019-03-27 Thread Xavi Hernandez
On Wed, 27 Mar 2019, 18:26 Pranith Kumar Karampuri, 
wrote:

>
>
> On Wed, Mar 27, 2019 at 8:38 PM Xavi Hernandez 
> wrote:
>
>> On Wed, Mar 27, 2019 at 2:20 PM Pranith Kumar Karampuri <
>> pkara...@redhat.com> wrote:
>>
>>>
>>>
>>> On Wed, Mar 27, 2019 at 6:38 PM Xavi Hernandez 
>>> wrote:
>>>
>>>> On Wed, Mar 27, 2019 at 1:13 PM Pranith Kumar Karampuri <
>>>> pkara...@redhat.com> wrote:
>>>>
>>>>>
>>>>>
>>>>> On Wed, Mar 27, 2019 at 5:13 PM Xavi Hernandez 
>>>>> wrote:
>>>>>
>>>>>> On Wed, Mar 27, 2019 at 11:52 AM Raghavendra Gowdappa <
>>>>>> rgowd...@redhat.com> wrote:
>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On Wed, Mar 27, 2019 at 12:56 PM Xavi Hernandez 
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Hi Raghavendra,
>>>>>>>>
>>>>>>>> On Wed, Mar 27, 2019 at 2:49 AM Raghavendra Gowdappa <
>>>>>>>> rgowd...@redhat.com> wrote:
>>>>>>>>
>>>>>>>>> All,
>>>>>>>>>
>>>>>>>>> Glusterfs cleans up POSIX locks held on an fd when the
>>>>>>>>> client/mount through which those locks are held disconnects from
>>>>>>>>> bricks/server. This helps Glusterfs to not run into a stale lock 
>>>>>>>>> problem
>>>>>>>>> later (For eg., if application unlocks while the connection was still
>>>>>>>>> down). However, this means the lock is no longer exclusive as other
>>>>>>>>> applications/clients can acquire the same lock. To communicate that 
>>>>>>>>> locks
>>>>>>>>> are no longer valid, we are planning to mark the fd (which has POSIX 
>>>>>>>>> locks)
>>>>>>>>> bad on a disconnect so that any future operations on that fd will 
>>>>>>>>> fail,
>>>>>>>>> forcing the application to re-open the fd and re-acquire locks it 
>>>>>>>>> needs [1].
>>>>>>>>>
>>>>>>>>
>>>>>>>> Wouldn't it be better to retake the locks when the brick is
>>>>>>>> reconnected if the lock is still in use ?
>>>>>>>>
>>>>>>>
>>>>>>> There is also  a possibility that clients may never reconnect.
>>>>>>> That's the primary reason why bricks assume the worst (client will not
>>>>>>> reconnect) and cleanup the locks.
>>>>>>>
>>>>>>
>>>>>> True, so it's fine to cleanup the locks. I'm not saying that locks
>>>>>> shouldn't be released on disconnect. The assumption is that if the client
>>>>>> has really died, it will also disconnect from other bricks, who will
>>>>>> release the locks. So, eventually, another client will have enough quorum
>>>>>> to attempt a lock that will succeed. In other words, if a client gets
>>>>>> disconnected from too many bricks simultaneously (loses Quorum), then 
>>>>>> that
>>>>>> client can be considered as bad and can return errors to the application.
>>>>>> This should also cause to release the locks on the remaining connected
>>>>>> bricks.
>>>>>>
>>>>>> On the other hand, if the disconnection is very short and the client
>>>>>> has not died, it will keep enough locked files (it has quorum) to avoid
>>>>>> other clients to successfully acquire a lock. In this case, if the brick 
>>>>>> is
>>>>>> reconnected, all existing locks should be reacquired to recover the
>>>>>> original state before the disconnection.
>>>>>>
>>>>>>
>>>>>>>
>>>>>>>> BTW, the referenced bug is not public. Should we open another bug
>>>>>>>> to track this ?
>>>>>>>>
>>>>>>>
>>>>>>> I've just opened up the comment to give enough context. I'll open a
>>>>>>> bug upstream too.
>>>>>>>
>>>>>>>
>>>

Re: [Gluster-devel] [Gluster-users] POSIX locks and disconnections between clients and bricks

2019-03-27 Thread Xavi Hernandez
On Wed, Mar 27, 2019 at 2:20 PM Pranith Kumar Karampuri 
wrote:

>
>
> On Wed, Mar 27, 2019 at 6:38 PM Xavi Hernandez 
> wrote:
>
>> On Wed, Mar 27, 2019 at 1:13 PM Pranith Kumar Karampuri <
>> pkara...@redhat.com> wrote:
>>
>>>
>>>
>>> On Wed, Mar 27, 2019 at 5:13 PM Xavi Hernandez 
>>> wrote:
>>>
>>>> On Wed, Mar 27, 2019 at 11:52 AM Raghavendra Gowdappa <
>>>> rgowd...@redhat.com> wrote:
>>>>
>>>>>
>>>>>
>>>>> On Wed, Mar 27, 2019 at 12:56 PM Xavi Hernandez 
>>>>> wrote:
>>>>>
>>>>>> Hi Raghavendra,
>>>>>>
>>>>>> On Wed, Mar 27, 2019 at 2:49 AM Raghavendra Gowdappa <
>>>>>> rgowd...@redhat.com> wrote:
>>>>>>
>>>>>>> All,
>>>>>>>
>>>>>>> Glusterfs cleans up POSIX locks held on an fd when the client/mount
>>>>>>> through which those locks are held disconnects from bricks/server. This
>>>>>>> helps Glusterfs to not run into a stale lock problem later (For eg., if
>>>>>>> application unlocks while the connection was still down). However, this
>>>>>>> means the lock is no longer exclusive as other applications/clients can
>>>>>>> acquire the same lock. To communicate that locks are no longer valid, we
>>>>>>> are planning to mark the fd (which has POSIX locks) bad on a disconnect 
>>>>>>> so
>>>>>>> that any future operations on that fd will fail, forcing the 
>>>>>>> application to
>>>>>>> re-open the fd and re-acquire locks it needs [1].
>>>>>>>
>>>>>>
>>>>>> Wouldn't it be better to retake the locks when the brick is
>>>>>> reconnected if the lock is still in use ?
>>>>>>
>>>>>
>>>>> There is also  a possibility that clients may never reconnect. That's
>>>>> the primary reason why bricks assume the worst (client will not reconnect)
>>>>> and cleanup the locks.
>>>>>
>>>>
>>>> True, so it's fine to cleanup the locks. I'm not saying that locks
>>>> shouldn't be released on disconnect. The assumption is that if the client
>>>> has really died, it will also disconnect from other bricks, who will
>>>> release the locks. So, eventually, another client will have enough quorum
>>>> to attempt a lock that will succeed. In other words, if a client gets
>>>> disconnected from too many bricks simultaneously (loses Quorum), then that
>>>> client can be considered as bad and can return errors to the application.
>>>> This should also cause to release the locks on the remaining connected
>>>> bricks.
>>>>
>>>> On the other hand, if the disconnection is very short and the client
>>>> has not died, it will keep enough locked files (it has quorum) to avoid
>>>> other clients to successfully acquire a lock. In this case, if the brick is
>>>> reconnected, all existing locks should be reacquired to recover the
>>>> original state before the disconnection.
>>>>
>>>>
>>>>>
>>>>>> BTW, the referenced bug is not public. Should we open another bug to
>>>>>> track this ?
>>>>>>
>>>>>
>>>>> I've just opened up the comment to give enough context. I'll open a
>>>>> bug upstream too.
>>>>>
>>>>>
>>>>>>
>>>>>>
>>>>>>>
>>>>>>> Note that with AFR/replicate in picture we can prevent errors to
>>>>>>> application as long as Quorum number of children "never ever" lost
>>>>>>> connection with bricks after locks have been acquired. I am using the 
>>>>>>> term
>>>>>>> "never ever" as locks are not healed back after re-connection and hence
>>>>>>> first disconnect would've marked the fd bad and the fd remains so even
>>>>>>> after re-connection happens. So, its not just Quorum number of children
>>>>>>> "currently online", but Quorum number of children "never having
>>>>>>> disconnected with bricks after locks are acquired".
>>>>>>>
>>>>>>
&g

Re: [Gluster-devel] [Gluster-users] POSIX locks and disconnections between clients and bricks

2019-03-27 Thread Xavi Hernandez
On Wed, Mar 27, 2019 at 1:13 PM Pranith Kumar Karampuri 
wrote:

>
>
> On Wed, Mar 27, 2019 at 5:13 PM Xavi Hernandez 
> wrote:
>
>> On Wed, Mar 27, 2019 at 11:52 AM Raghavendra Gowdappa <
>> rgowd...@redhat.com> wrote:
>>
>>>
>>>
>>> On Wed, Mar 27, 2019 at 12:56 PM Xavi Hernandez 
>>> wrote:
>>>
>>>> Hi Raghavendra,
>>>>
>>>> On Wed, Mar 27, 2019 at 2:49 AM Raghavendra Gowdappa <
>>>> rgowd...@redhat.com> wrote:
>>>>
>>>>> All,
>>>>>
>>>>> Glusterfs cleans up POSIX locks held on an fd when the client/mount
>>>>> through which those locks are held disconnects from bricks/server. This
>>>>> helps Glusterfs to not run into a stale lock problem later (For eg., if
>>>>> application unlocks while the connection was still down). However, this
>>>>> means the lock is no longer exclusive as other applications/clients can
>>>>> acquire the same lock. To communicate that locks are no longer valid, we
>>>>> are planning to mark the fd (which has POSIX locks) bad on a disconnect so
>>>>> that any future operations on that fd will fail, forcing the application 
>>>>> to
>>>>> re-open the fd and re-acquire locks it needs [1].
>>>>>
>>>>
>>>> Wouldn't it be better to retake the locks when the brick is reconnected
>>>> if the lock is still in use ?
>>>>
>>>
>>> There is also  a possibility that clients may never reconnect. That's
>>> the primary reason why bricks assume the worst (client will not reconnect)
>>> and cleanup the locks.
>>>
>>
>> True, so it's fine to cleanup the locks. I'm not saying that locks
>> shouldn't be released on disconnect. The assumption is that if the client
>> has really died, it will also disconnect from other bricks, who will
>> release the locks. So, eventually, another client will have enough quorum
>> to attempt a lock that will succeed. In other words, if a client gets
>> disconnected from too many bricks simultaneously (loses Quorum), then that
>> client can be considered as bad and can return errors to the application.
>> This should also cause to release the locks on the remaining connected
>> bricks.
>>
>> On the other hand, if the disconnection is very short and the client has
>> not died, it will keep enough locked files (it has quorum) to avoid other
>> clients to successfully acquire a lock. In this case, if the brick is
>> reconnected, all existing locks should be reacquired to recover the
>> original state before the disconnection.
>>
>>
>>>
>>>> BTW, the referenced bug is not public. Should we open another bug to
>>>> track this ?
>>>>
>>>
>>> I've just opened up the comment to give enough context. I'll open a bug
>>> upstream too.
>>>
>>>
>>>>
>>>>
>>>>>
>>>>> Note that with AFR/replicate in picture we can prevent errors to
>>>>> application as long as Quorum number of children "never ever" lost
>>>>> connection with bricks after locks have been acquired. I am using the term
>>>>> "never ever" as locks are not healed back after re-connection and hence
>>>>> first disconnect would've marked the fd bad and the fd remains so even
>>>>> after re-connection happens. So, its not just Quorum number of children
>>>>> "currently online", but Quorum number of children "never having
>>>>> disconnected with bricks after locks are acquired".
>>>>>
>>>>
>>>> I think this requisite is not feasible. In a distributed file system,
>>>> sooner or later all bricks will be disconnected. It could be because of
>>>> failures or because an upgrade is done, but it will happen.
>>>>
>>>> The difference here is how long are fd's kept open. If applications
>>>> open and close files frequently enough (i.e. the fd is not kept open more
>>>> time than it takes to have more than Quorum bricks disconnected) then
>>>> there's no problem. The problem can only appear on applications that open
>>>> files for a long time and also use posix locks. In this case, the only good
>>>> solution I see is to retake the locks on brick reconnection.
>>>>
>>>
>>> Agree. But lock-healing should be done only by HA layers like AFR

Re: [Gluster-devel] [Gluster-users] POSIX locks and disconnections between clients and bricks

2019-03-27 Thread Xavi Hernandez
On Wed, Mar 27, 2019 at 11:54 AM Raghavendra Gowdappa 
wrote:

>
>
> On Wed, Mar 27, 2019 at 4:22 PM Raghavendra Gowdappa 
> wrote:
>
>>
>>
>> On Wed, Mar 27, 2019 at 12:56 PM Xavi Hernandez 
>> wrote:
>>
>>> Hi Raghavendra,
>>>
>>> On Wed, Mar 27, 2019 at 2:49 AM Raghavendra Gowdappa <
>>> rgowd...@redhat.com> wrote:
>>>
>>>> All,
>>>>
>>>> Glusterfs cleans up POSIX locks held on an fd when the client/mount
>>>> through which those locks are held disconnects from bricks/server. This
>>>> helps Glusterfs to not run into a stale lock problem later (For eg., if
>>>> application unlocks while the connection was still down). However, this
>>>> means the lock is no longer exclusive as other applications/clients can
>>>> acquire the same lock. To communicate that locks are no longer valid, we
>>>> are planning to mark the fd (which has POSIX locks) bad on a disconnect so
>>>> that any future operations on that fd will fail, forcing the application to
>>>> re-open the fd and re-acquire locks it needs [1].
>>>>
>>>
>>> Wouldn't it be better to retake the locks when the brick is reconnected
>>> if the lock is still in use ?
>>>
>>
>> There is also  a possibility that clients may never reconnect. That's the
>> primary reason why bricks assume the worst (client will not reconnect) and
>> cleanup the locks.
>>
>>
>>> BTW, the referenced bug is not public. Should we open another bug to
>>> track this ?
>>>
>>
>> I've just opened up the comment to give enough context. I'll open a bug
>> upstream too.
>>
>>
>>>
>>>
>>>>
>>>> Note that with AFR/replicate in picture we can prevent errors to
>>>> application as long as Quorum number of children "never ever" lost
>>>> connection with bricks after locks have been acquired. I am using the term
>>>> "never ever" as locks are not healed back after re-connection and hence
>>>> first disconnect would've marked the fd bad and the fd remains so even
>>>> after re-connection happens. So, its not just Quorum number of children
>>>> "currently online", but Quorum number of children "never having
>>>> disconnected with bricks after locks are acquired".
>>>>
>>>
>>> I think this requisite is not feasible. In a distributed file system,
>>> sooner or later all bricks will be disconnected. It could be because of
>>> failures or because an upgrade is done, but it will happen.
>>>
>>> The difference here is how long are fd's kept open. If applications open
>>> and close files frequently enough (i.e. the fd is not kept open more time
>>> than it takes to have more than Quorum bricks disconnected) then there's no
>>> problem. The problem can only appear on applications that open files for a
>>> long time and also use posix locks. In this case, the only good solution I
>>> see is to retake the locks on brick reconnection.
>>>
>>
>> Agree. But lock-healing should be done only by HA layers like AFR/EC as
>> only they know whether there are enough online bricks to have prevented any
>> conflicting lock. Protocol/client itself doesn't have enough information to
>> do that. If its a plain distribute, I don't see a way to heal locks without
>> loosing the property of exclusivity of locks.
>>
>> What I proposed is a short term solution. mid to long term solution
>> should be lock healing feature implemented in AFR/EC. In fact I had this
>> conversation with +Karampuri, Pranith  before
>> posting this msg to ML.
>>
>>
>>>
>>>> However, this use case is not affected if the application don't acquire
>>>> any POSIX locks. So, I am interested in knowing
>>>> * whether your use cases use POSIX locks?
>>>> * Is it feasible for your application to re-open fds and re-acquire
>>>> locks on seeing EBADFD errors?
>>>>
>>>
>>> I think that many applications are not prepared to handle that.
>>>
>>
>> I too suspected that and in fact not too happy with the solution. But
>> went ahead with this mail as I heard implementing lock-heal  in AFR will
>> take time and hence there are no alternative short term solutions.
>>
>
> Also failing loudly is preferred to silently dropping locks.
>

Yes. Silently dropping locks can cause corruption, which is worse. However
causing application failures doesn't improve user experience either.

Unfortunately I'm not aware of any other short term solution right now.


>
>>
>>
>>> Xavi
>>>
>>>
>>>>
>>>> [1] https://bugzilla.redhat.com/show_bug.cgi?id=1689375#c7
>>>>
>>>> regards,
>>>> Raghavendra
>>>>
>>>> ___
>>>> Gluster-users mailing list
>>>> gluster-us...@gluster.org
>>>> https://lists.gluster.org/mailman/listinfo/gluster-users
>>>
>>>
___
Gluster-devel mailing list
Gluster-devel@gluster.org
https://lists.gluster.org/mailman/listinfo/gluster-devel

Re: [Gluster-devel] [Gluster-users] POSIX locks and disconnections between clients and bricks

2019-03-27 Thread Xavi Hernandez
On Wed, Mar 27, 2019 at 11:52 AM Raghavendra Gowdappa 
wrote:

>
>
> On Wed, Mar 27, 2019 at 12:56 PM Xavi Hernandez 
> wrote:
>
>> Hi Raghavendra,
>>
>> On Wed, Mar 27, 2019 at 2:49 AM Raghavendra Gowdappa 
>> wrote:
>>
>>> All,
>>>
>>> Glusterfs cleans up POSIX locks held on an fd when the client/mount
>>> through which those locks are held disconnects from bricks/server. This
>>> helps Glusterfs to not run into a stale lock problem later (For eg., if
>>> application unlocks while the connection was still down). However, this
>>> means the lock is no longer exclusive as other applications/clients can
>>> acquire the same lock. To communicate that locks are no longer valid, we
>>> are planning to mark the fd (which has POSIX locks) bad on a disconnect so
>>> that any future operations on that fd will fail, forcing the application to
>>> re-open the fd and re-acquire locks it needs [1].
>>>
>>
>> Wouldn't it be better to retake the locks when the brick is reconnected
>> if the lock is still in use ?
>>
>
> There is also  a possibility that clients may never reconnect. That's the
> primary reason why bricks assume the worst (client will not reconnect) and
> cleanup the locks.
>

True, so it's fine to cleanup the locks. I'm not saying that locks
shouldn't be released on disconnect. The assumption is that if the client
has really died, it will also disconnect from other bricks, who will
release the locks. So, eventually, another client will have enough quorum
to attempt a lock that will succeed. In other words, if a client gets
disconnected from too many bricks simultaneously (loses Quorum), then that
client can be considered as bad and can return errors to the application.
This should also cause to release the locks on the remaining connected
bricks.

On the other hand, if the disconnection is very short and the client has
not died, it will keep enough locked files (it has quorum) to avoid other
clients to successfully acquire a lock. In this case, if the brick is
reconnected, all existing locks should be reacquired to recover the
original state before the disconnection.


>
>> BTW, the referenced bug is not public. Should we open another bug to
>> track this ?
>>
>
> I've just opened up the comment to give enough context. I'll open a bug
> upstream too.
>
>
>>
>>
>>>
>>> Note that with AFR/replicate in picture we can prevent errors to
>>> application as long as Quorum number of children "never ever" lost
>>> connection with bricks after locks have been acquired. I am using the term
>>> "never ever" as locks are not healed back after re-connection and hence
>>> first disconnect would've marked the fd bad and the fd remains so even
>>> after re-connection happens. So, its not just Quorum number of children
>>> "currently online", but Quorum number of children "never having
>>> disconnected with bricks after locks are acquired".
>>>
>>
>> I think this requisite is not feasible. In a distributed file system,
>> sooner or later all bricks will be disconnected. It could be because of
>> failures or because an upgrade is done, but it will happen.
>>
>> The difference here is how long are fd's kept open. If applications open
>> and close files frequently enough (i.e. the fd is not kept open more time
>> than it takes to have more than Quorum bricks disconnected) then there's no
>> problem. The problem can only appear on applications that open files for a
>> long time and also use posix locks. In this case, the only good solution I
>> see is to retake the locks on brick reconnection.
>>
>
> Agree. But lock-healing should be done only by HA layers like AFR/EC as
> only they know whether there are enough online bricks to have prevented any
> conflicting lock. Protocol/client itself doesn't have enough information to
> do that. If its a plain distribute, I don't see a way to heal locks without
> loosing the property of exclusivity of locks.
>

Lock-healing of locks acquired while a brick was disconnected need to be
handled by AFR/EC. However, locks already present at the moment of
disconnection could be recovered by client xlator itself as long as the
file has not been closed (which client xlator already knows).

Xavi


> What I proposed is a short term solution. mid to long term solution should
> be lock healing feature implemented in AFR/EC. In fact I had this
> conversation with +Karampuri, Pranith  before
> posting this msg to ML.
>
>
>>
>>> However, this use case is not affected if the application don't acquire
&g

Re: [Gluster-devel] [Gluster-users] POSIX locks and disconnections between clients and bricks

2019-03-27 Thread Xavi Hernandez
Hi Raghavendra,

On Wed, Mar 27, 2019 at 2:49 AM Raghavendra Gowdappa 
wrote:

> All,
>
> Glusterfs cleans up POSIX locks held on an fd when the client/mount
> through which those locks are held disconnects from bricks/server. This
> helps Glusterfs to not run into a stale lock problem later (For eg., if
> application unlocks while the connection was still down). However, this
> means the lock is no longer exclusive as other applications/clients can
> acquire the same lock. To communicate that locks are no longer valid, we
> are planning to mark the fd (which has POSIX locks) bad on a disconnect so
> that any future operations on that fd will fail, forcing the application to
> re-open the fd and re-acquire locks it needs [1].
>

Wouldn't it be better to retake the locks when the brick is reconnected if
the lock is still in use ?

BTW, the referenced bug is not public. Should we open another bug to track
this ?


>
> Note that with AFR/replicate in picture we can prevent errors to
> application as long as Quorum number of children "never ever" lost
> connection with bricks after locks have been acquired. I am using the term
> "never ever" as locks are not healed back after re-connection and hence
> first disconnect would've marked the fd bad and the fd remains so even
> after re-connection happens. So, its not just Quorum number of children
> "currently online", but Quorum number of children "never having
> disconnected with bricks after locks are acquired".
>

I think this requisite is not feasible. In a distributed file system,
sooner or later all bricks will be disconnected. It could be because of
failures or because an upgrade is done, but it will happen.

The difference here is how long are fd's kept open. If applications open
and close files frequently enough (i.e. the fd is not kept open more time
than it takes to have more than Quorum bricks disconnected) then there's no
problem. The problem can only appear on applications that open files for a
long time and also use posix locks. In this case, the only good solution I
see is to retake the locks on brick reconnection.


> However, this use case is not affected if the application don't acquire
> any POSIX locks. So, I am interested in knowing
> * whether your use cases use POSIX locks?
> * Is it feasible for your application to re-open fds and re-acquire locks
> on seeing EBADFD errors?
>

I think that many applications are not prepared to handle that.

Xavi


>
> [1] https://bugzilla.redhat.com/show_bug.cgi?id=1689375#c7
>
> regards,
> Raghavendra
>
> ___
> Gluster-users mailing list
> gluster-us...@gluster.org
> https://lists.gluster.org/mailman/listinfo/gluster-users
___
Gluster-devel mailing list
Gluster-devel@gluster.org
https://lists.gluster.org/mailman/listinfo/gluster-devel

Re: [Gluster-devel] I/O performance

2019-02-13 Thread Xavi Hernandez
Here are the results of the last run:
https://docs.google.com/spreadsheets/d/19JqvuFKZxKifgrhLF-5-bgemYj8XKldUox1QwsmGj2k/edit?usp=sharing

Each test has been run with a rough approximation of the best configuration
I've found (in number of client and brick threads), but I haven't done an
exhaustive search of the best configuration in each case.

The "fio rand write" test seems to have a big regression. An initial check
of the data shows that 2 of the 5 runs have taken > 50% more time. I'll try
to check why.

Many of the tests show a very high disk utilization, so comparisons may not
be accurate. In any case it's clear that we need a method to automatically
adjust the number of worker threads to the given load to make this useful.
Without that it's virtually impossible to find a fixed number of threads
that will work fine in all cases. I'm currently working on this.

Xavi

On Wed, Feb 13, 2019 at 11:34 AM Xavi Hernandez 
wrote:

> On Tue, Feb 12, 2019 at 1:30 AM Vijay Bellur  wrote:
>
>>
>>
>> On Tue, Feb 5, 2019 at 10:57 PM Xavi Hernandez 
>> wrote:
>>
>>> On Wed, Feb 6, 2019 at 7:00 AM Poornima Gurusiddaiah <
>>> pguru...@redhat.com> wrote:
>>>
>>>>
>>>>
>>>> On Tue, Feb 5, 2019, 10:53 PM Xavi Hernandez >>> wrote:
>>>>
>>>>> On Fri, Feb 1, 2019 at 1:51 PM Xavi Hernandez 
>>>>> wrote:
>>>>>
>>>>>> On Fri, Feb 1, 2019 at 1:25 PM Poornima Gurusiddaiah <
>>>>>> pguru...@redhat.com> wrote:
>>>>>>
>>>>>>> Can the threads be categorised to do certain kinds of fops?
>>>>>>>
>>>>>>
>>>>>> Could be, but creating multiple thread groups for different tasks is
>>>>>> generally bad because many times you end up with lots of idle threads 
>>>>>> which
>>>>>> waste resources and could increase contention. I think we should only
>>>>>> differentiate threads if it's absolutely necessary.
>>>>>>
>>>>>>
>>>>>>> Read/write affinitise to certain set of threads, the other metadata
>>>>>>> fops to other set of threads. So we limit the read/write threads and not
>>>>>>> the metadata threads? Also if aio is enabled in the backend the threads
>>>>>>> will not be blocked on disk IO right?
>>>>>>>
>>>>>>
>>>>>> If we don't block the thread but we don't prevent more requests to go
>>>>>> to the disk, then we'll probably have the same problem. Anyway, I'll try 
>>>>>> to
>>>>>> run some tests with AIO to see if anything changes.
>>>>>>
>>>>>
>>>>> I've run some simple tests with AIO enabled and results are not good.
>>>>> A simple dd takes >25% more time. Multiple parallel dd take 35% more time
>>>>> to complete.
>>>>>
>>>>
>>>>
>>>> Thank you. That is strange! Had few questions, what tests are you
>>>> running for measuring the io-threads performance(not particularly aoi)? is
>>>> it dd from multiple clients?
>>>>
>>>
>>> Yes, it's a bit strange. What I see is that many threads from the thread
>>> pool are active but using very little CPU. I also see an AIO thread for
>>> each brick, but its CPU usage is not big either. Wait time is always 0 (I
>>> think this is a side effect of AIO activity). However system load grows
>>> very high. I've seen around 50, while on the normal test without AIO it's
>>> stays around 20-25.
>>>
>>> Right now I'm running the tests on a single machine (no real network
>>> communication) using an NVMe disk as storage. I use a single mount point.
>>> The tests I'm running are these:
>>>
>>>- Single dd, 128 GiB, blocks of 1MiB
>>>- 16 parallel dd, 8 GiB per dd, blocks of 1MiB
>>>- fio in sequential write mode, direct I/O, blocks of 128k, 16
>>>threads, 8GiB per file
>>>- fio in sequential read mode, direct I/O, blocks of 128k, 16
>>>threads, 8GiB per file
>>>- fio in random write mode, direct I/O, blocks of 128k, 16 threads,
>>>8GiB per file
>>>- fio in random read mode, direct I/O, blocks of 128k, 16 threads,
>>>8GiB per file
>>>- smallfile create, 16 threads, 256 files per thread, 32 MiB per
>>>file (with one brick down, for the following test)
>>>- self-heal of an entire brick (from the previous smallfile test)
>>>- pgbench init phase with scale 100
>>>
>>> I run all these tests for a replica 3 volume and a disperse 4+2 volume.
>>>
>>
>>
>> Are these performance results available somewhere? I am quite curious to
>> understand the performance gains on NVMe!
>>
>
> I'm updating test results with the latest build. I'll report it here once
> it's complete.
>
> Xavi
>
>>
___
Gluster-devel mailing list
Gluster-devel@gluster.org
https://lists.gluster.org/mailman/listinfo/gluster-devel

Re: [Gluster-devel] I/O performance

2019-02-05 Thread Xavi Hernandez
On Wed, Feb 6, 2019 at 7:00 AM Poornima Gurusiddaiah 
wrote:

>
>
> On Tue, Feb 5, 2019, 10:53 PM Xavi Hernandez 
>> On Fri, Feb 1, 2019 at 1:51 PM Xavi Hernandez 
>> wrote:
>>
>>> On Fri, Feb 1, 2019 at 1:25 PM Poornima Gurusiddaiah <
>>> pguru...@redhat.com> wrote:
>>>
>>>> Can the threads be categorised to do certain kinds of fops?
>>>>
>>>
>>> Could be, but creating multiple thread groups for different tasks is
>>> generally bad because many times you end up with lots of idle threads which
>>> waste resources and could increase contention. I think we should only
>>> differentiate threads if it's absolutely necessary.
>>>
>>>
>>>> Read/write affinitise to certain set of threads, the other metadata
>>>> fops to other set of threads. So we limit the read/write threads and not
>>>> the metadata threads? Also if aio is enabled in the backend the threads
>>>> will not be blocked on disk IO right?
>>>>
>>>
>>> If we don't block the thread but we don't prevent more requests to go to
>>> the disk, then we'll probably have the same problem. Anyway, I'll try to
>>> run some tests with AIO to see if anything changes.
>>>
>>
>> I've run some simple tests with AIO enabled and results are not good. A
>> simple dd takes >25% more time. Multiple parallel dd take 35% more time to
>> complete.
>>
>
>
> Thank you. That is strange! Had few questions, what tests are you running
> for measuring the io-threads performance(not particularly aoi)? is it dd
> from multiple clients?
>

Yes, it's a bit strange. What I see is that many threads from the thread
pool are active but using very little CPU. I also see an AIO thread for
each brick, but its CPU usage is not big either. Wait time is always 0 (I
think this is a side effect of AIO activity). However system load grows
very high. I've seen around 50, while on the normal test without AIO it's
stays around 20-25.

Right now I'm running the tests on a single machine (no real network
communication) using an NVMe disk as storage. I use a single mount point.
The tests I'm running are these:

   - Single dd, 128 GiB, blocks of 1MiB
   - 16 parallel dd, 8 GiB per dd, blocks of 1MiB
   - fio in sequential write mode, direct I/O, blocks of 128k, 16 threads,
   8GiB per file
   - fio in sequential read mode, direct I/O, blocks of 128k, 16 threads,
   8GiB per file
   - fio in random write mode, direct I/O, blocks of 128k, 16 threads, 8GiB
   per file
   - fio in random read mode, direct I/O, blocks of 128k, 16 threads, 8GiB
   per file
   - smallfile create, 16 threads, 256 files per thread, 32 MiB per file
   (with one brick down, for the following test)
   - self-heal of an entire brick (from the previous smallfile test)
   - pgbench init phase with scale 100

I run all these tests for a replica 3 volume and a disperse 4+2 volume.

Xavi


> Regards,
> Poornima
>
>
>> Xavi
>>
>>
>>> All this is based on the assumption that large number of parallel read
>>>> writes make the disk perf bad but not the large number of dentry and
>>>> metadata ops. Is that true?
>>>>
>>>
>>> It depends. If metadata is not cached, it's as bad as a read or write
>>> since it requires a disk access (a clear example of this is the bad
>>> performance of 'ls' in cold cache, which is basically metadata reads). In
>>> fact, cached data reads are also very fast, and data writes could go to the
>>> cache and be updated later in background, so I think the important point is
>>> if things are cached or not, instead of if they are data or metadata. Since
>>> we don't have this information from the user side, it's hard to tell what's
>>> better. My opinion is that we shouldn't differentiate requests of
>>> data/metadata. If metadata requests happen to be faster, then that thread
>>> will be able to handle other requests immediately, which seems good enough.
>>>
>>> However there's one thing that I would do. I would differentiate reads
>>> (data or metadata) from writes. Normally writes come from cached
>>> information that is flushed to disk at some point, so this normally happens
>>> in the background. But reads tend to be in foreground, meaning that someone
>>> (user or application) is waiting for it. So I would give preference to
>>> reads over writes. To do so effectively, we need to not saturate the
>>> backend, otherwise when we need to send a read, it will still need to wait
>>> for all pending requests to complete. If disks are not saturated, we can
>>> hav

Re: [Gluster-devel] I/O performance

2019-02-05 Thread Xavi Hernandez
On Fri, Feb 1, 2019 at 1:51 PM Xavi Hernandez  wrote:

> On Fri, Feb 1, 2019 at 1:25 PM Poornima Gurusiddaiah 
> wrote:
>
>> Can the threads be categorised to do certain kinds of fops?
>>
>
> Could be, but creating multiple thread groups for different tasks is
> generally bad because many times you end up with lots of idle threads which
> waste resources and could increase contention. I think we should only
> differentiate threads if it's absolutely necessary.
>
>
>> Read/write affinitise to certain set of threads, the other metadata fops
>> to other set of threads. So we limit the read/write threads and not the
>> metadata threads? Also if aio is enabled in the backend the threads will
>> not be blocked on disk IO right?
>>
>
> If we don't block the thread but we don't prevent more requests to go to
> the disk, then we'll probably have the same problem. Anyway, I'll try to
> run some tests with AIO to see if anything changes.
>

I've run some simple tests with AIO enabled and results are not good. A
simple dd takes >25% more time. Multiple parallel dd take 35% more time to
complete.

Xavi


> All this is based on the assumption that large number of parallel read
>> writes make the disk perf bad but not the large number of dentry and
>> metadata ops. Is that true?
>>
>
> It depends. If metadata is not cached, it's as bad as a read or write
> since it requires a disk access (a clear example of this is the bad
> performance of 'ls' in cold cache, which is basically metadata reads). In
> fact, cached data reads are also very fast, and data writes could go to the
> cache and be updated later in background, so I think the important point is
> if things are cached or not, instead of if they are data or metadata. Since
> we don't have this information from the user side, it's hard to tell what's
> better. My opinion is that we shouldn't differentiate requests of
> data/metadata. If metadata requests happen to be faster, then that thread
> will be able to handle other requests immediately, which seems good enough.
>
> However there's one thing that I would do. I would differentiate reads
> (data or metadata) from writes. Normally writes come from cached
> information that is flushed to disk at some point, so this normally happens
> in the background. But reads tend to be in foreground, meaning that someone
> (user or application) is waiting for it. So I would give preference to
> reads over writes. To do so effectively, we need to not saturate the
> backend, otherwise when we need to send a read, it will still need to wait
> for all pending requests to complete. If disks are not saturated, we can
> have the answer to the read quite fast, and then continue processing the
> remaining writes.
>
> Anyway, I may be wrong, since all these things depend on too many factors.
> I haven't done any specific tests about this. It's more like a
> brainstorming. As soon as I can I would like to experiment with this and
> get some empirical data.
>
> Xavi
>
>
>> Thanks,
>> Poornima
>>
>>
>> On Fri, Feb 1, 2019, 5:34 PM Emmanuel Dreyfus >
>>> On Thu, Jan 31, 2019 at 10:53:48PM -0800, Vijay Bellur wrote:
>>> > Perhaps we could throttle both aspects - number of I/O requests per
>>> disk
>>>
>>> While there it would be nice to detect and report  a disk with lower than
>>> peer performance: that happen sometimes when a disk is dying, and last
>>> time I was hit by that performance problem, I had a hard time finding
>>> the culprit.
>>>
>>> --
>>> Emmanuel Dreyfus
>>> m...@netbsd.org
>>> ___
>>> Gluster-devel mailing list
>>> Gluster-devel@gluster.org
>>> https://lists.gluster.org/mailman/listinfo/gluster-devel
>>>
>>
___
Gluster-devel mailing list
Gluster-devel@gluster.org
https://lists.gluster.org/mailman/listinfo/gluster-devel

Re: [Gluster-devel] I/O performance

2019-02-01 Thread Xavi Hernandez
On Fri, Feb 1, 2019 at 1:25 PM Poornima Gurusiddaiah 
wrote:

> Can the threads be categorised to do certain kinds of fops?
>

Could be, but creating multiple thread groups for different tasks is
generally bad because many times you end up with lots of idle threads which
waste resources and could increase contention. I think we should only
differentiate threads if it's absolutely necessary.


> Read/write affinitise to certain set of threads, the other metadata fops
> to other set of threads. So we limit the read/write threads and not the
> metadata threads? Also if aio is enabled in the backend the threads will
> not be blocked on disk IO right?
>

If we don't block the thread but we don't prevent more requests to go to
the disk, then we'll probably have the same problem. Anyway, I'll try to
run some tests with AIO to see if anything changes.

All this is based on the assumption that large number of parallel read
> writes make the disk perf bad but not the large number of dentry and
> metadata ops. Is that true?
>

It depends. If metadata is not cached, it's as bad as a read or write since
it requires a disk access (a clear example of this is the bad performance
of 'ls' in cold cache, which is basically metadata reads). In fact, cached
data reads are also very fast, and data writes could go to the cache and be
updated later in background, so I think the important point is if things
are cached or not, instead of if they are data or metadata. Since we don't
have this information from the user side, it's hard to tell what's better.
My opinion is that we shouldn't differentiate requests of data/metadata. If
metadata requests happen to be faster, then that thread will be able to
handle other requests immediately, which seems good enough.

However there's one thing that I would do. I would differentiate reads
(data or metadata) from writes. Normally writes come from cached
information that is flushed to disk at some point, so this normally happens
in the background. But reads tend to be in foreground, meaning that someone
(user or application) is waiting for it. So I would give preference to
reads over writes. To do so effectively, we need to not saturate the
backend, otherwise when we need to send a read, it will still need to wait
for all pending requests to complete. If disks are not saturated, we can
have the answer to the read quite fast, and then continue processing the
remaining writes.

Anyway, I may be wrong, since all these things depend on too many factors.
I haven't done any specific tests about this. It's more like a
brainstorming. As soon as I can I would like to experiment with this and
get some empirical data.

Xavi


> Thanks,
> Poornima
>
>
> On Fri, Feb 1, 2019, 5:34 PM Emmanuel Dreyfus 
>> On Thu, Jan 31, 2019 at 10:53:48PM -0800, Vijay Bellur wrote:
>> > Perhaps we could throttle both aspects - number of I/O requests per disk
>>
>> While there it would be nice to detect and report  a disk with lower than
>> peer performance: that happen sometimes when a disk is dying, and last
>> time I was hit by that performance problem, I had a hard time finding
>> the culprit.
>>
>> --
>> Emmanuel Dreyfus
>> m...@netbsd.org
>> ___
>> Gluster-devel mailing list
>> Gluster-devel@gluster.org
>> https://lists.gluster.org/mailman/listinfo/gluster-devel
>>
>
___
Gluster-devel mailing list
Gluster-devel@gluster.org
https://lists.gluster.org/mailman/listinfo/gluster-devel

Re: [Gluster-devel] I/O performance

2019-01-31 Thread Xavi Hernandez
On Fri, Feb 1, 2019 at 7:54 AM Vijay Bellur  wrote:

>
>
> On Thu, Jan 31, 2019 at 10:01 AM Xavi Hernandez 
> wrote:
>
>> Hi,
>>
>> I've been doing some tests with the global thread pool [1], and I've
>> observed one important thing:
>>
>> Since this new thread pool has very low contention (apparently), it
>> exposes other problems when the number of threads grows. What I've seen is
>> that some workloads use all available threads on bricks to do I/O, causing
>> avgload to grow rapidly and saturating the machine (or it seems so), which
>> really makes everything slower. Reducing the maximum number of threads
>> improves performance actually. Other workloads, though, do little I/O
>> (probably most is locking or smallfile operations). In this case limiting
>> the number of threads to a small value causes a performance reduction. To
>> increase performance we need more threads.
>>
>> So this is making me thing that maybe we should implement some sort of
>> I/O queue with a maximum I/O depth for each brick (or disk if bricks share
>> same disk). This way we can limit the amount of requests physically
>> accessing the underlying FS concurrently, without actually limiting the
>> number of threads that can be doing other things on each brick. I think
>> this could improve performance.
>>
>
> Perhaps we could throttle both aspects - number of I/O requests per disk
> and the number of threads too?  That way we will have the ability to behave
> well when there is bursty I/O to the same disk and when there are multiple
> concurrent requests to different disks. Do you have a reason to not limit
> the number of threads?
>

No, in fact the global thread pool does have a limit for the number of
threads. I'm not saying to replace the thread limit for I/O depth control,
I think we need both. I think we need to clearly identify which threads are
doing I/O and limit them, even if there are more threads available. The
reason is easy: suppose we have a fixed number of threads. If we have heavy
load sent in parallel, it's quite possible that all threads get blocked
doing some I/O. This has two consequences:

   1. There are no more threads to execute other things, like sending
   answers to the client, or start processing new incoming requests. So CPU is
   underutilized.
   2. Massive parallel access to a FS actually decreases performance

This means that we can do less work and this work takes more time, which is
bad.

If we limit the number of threads that can actually be doing FS I/O, it's
easy to keep FS responsive and we'll still have more threads to do other
work.


>
>> Maybe this approach could also be useful in client side, but I think it's
>> not so critical there.
>>
>
> Agree, rate limiting on the server side would be more appropriate.
>

Only thing to consider here is that if we limit rate on servers but clients
can generate more requests without limit, we may require lots of memory to
track all ongoing requests. Anyway, I think this is not the most important
thing now, so if we solve the server-side problem, then we can check if
this is really needed or not (it could happen that client applications
limit themselves automatically because they will be waiting for answers
from server before sending more requests, unless the number of application
running concurrently is really huge).

Xavi
___
Gluster-devel mailing list
Gluster-devel@gluster.org
https://lists.gluster.org/mailman/listinfo/gluster-devel

[Gluster-devel] I/O performance

2019-01-31 Thread Xavi Hernandez
Hi,

I've been doing some tests with the global thread pool [1], and I've
observed one important thing:

Since this new thread pool has very low contention (apparently), it exposes
other problems when the number of threads grows. What I've seen is that
some workloads use all available threads on bricks to do I/O, causing
avgload to grow rapidly and saturating the machine (or it seems so), which
really makes everything slower. Reducing the maximum number of threads
improves performance actually. Other workloads, though, do little I/O
(probably most is locking or smallfile operations). In this case limiting
the number of threads to a small value causes a performance reduction. To
increase performance we need more threads.

So this is making me thing that maybe we should implement some sort of I/O
queue with a maximum I/O depth for each brick (or disk if bricks share same
disk). This way we can limit the amount of requests physically accessing
the underlying FS concurrently, without actually limiting the number of
threads that can be doing other things on each brick. I think this could
improve performance.

Maybe this approach could also be useful in client side, but I think it's
not so critical there.

What do you think ?

Xavi

[1] https://review.gluster.org/c/glusterfs/+/20636
___
Gluster-devel mailing list
Gluster-devel@gluster.org
https://lists.gluster.org/mailman/listinfo/gluster-devel

Re: [Gluster-devel] Performance improvements

2019-01-31 Thread Xavi Hernandez
On Sun, Jan 27, 2019 at 8:03 AM Xavi Hernandez 
wrote:

> On Fri, 25 Jan 2019, 08:53 Vijay Bellur 
>> Thank you for the detailed update, Xavi! This looks very interesting.
>>
>> On Thu, Jan 24, 2019 at 7:50 AM Xavi Hernandez 
>> wrote:
>>
>>> Hi all,
>>>
>>> I've just updated a patch [1] that implements a new thread pool based on
>>> a wait-free queue provided by userspace-rcu library. The patch also
>>> includes an auto scaling mechanism that only keeps running the needed
>>> amount of threads for the current workload.
>>>
>>> This new approach has some advantages:
>>>
>>>- It's provided globally inside libglusterfs instead of inside an
>>>xlator
>>>
>>> This makes it possible that fuse thread and epoll threads transfer the
>>> received request to another thread sooner, wating less CPU and reacting
>>> sooner to other incoming requests.
>>>
>>>
>>>- Adding jobs to the queue used by the thread pool only requires an
>>>atomic operation
>>>
>>> This makes the producer side of the queue really fast, almost with no
>>> delay.
>>>
>>>
>>>- Contention is reduced
>>>
>>> The producer side has negligible contention thanks to the wait-free
>>> enqueue operation based on an atomic access. The consumer side requires a
>>> mutex, but the duration is very small and the scaling mechanism makes sure
>>> that there are no more threads than needed contending for the mutex.
>>>
>>>
>>> This change disables io-threads, since it replaces part of its
>>> functionality. However there are two things that could be needed from
>>> io-threads:
>>>
>>>- Prioritization of fops
>>>
>>> Currently, io-threads assigns priorities to each fop, so that some fops
>>> are handled before than others.
>>>
>>>
>>>- Fair distribution of execution slots between clients
>>>
>>> Currently, io-threads processes requests from each client in round-robin.
>>>
>>>
>>> These features are not implemented right now. If they are needed,
>>> probably the best thing to do would be to keep them inside io-threads, but
>>> change its implementation so that it uses the global threads from the
>>> thread pool instead of its own threads.
>>>
>>
>>
>> These features are indeed useful to have and hence modifying the
>> implementation of io-threads to provide this behavior would be welcome.
>>
>>
>>
>>>
>>>
>>> These tests have shown that the limiting factor has been the disk in
>>> most cases, so it's hard to tell if the change has really improved things.
>>> There is only one clear exception: self-heal on a dispersed volume
>>> completes 12.7% faster. The utilization of CPU has also dropped drastically:
>>>
>>> Old implementation: 12.30 user, 41.78 sys, 43.16 idle,  0.73 wait
>>>
>>> New implementation: 4.91 user,  5.52 sys, 81.60 idle,  5.91 wait
>>>
>>>
>>> Now I'm running some more tests on NVMe to try to see the effects of the
>>> change when disk is not limiting performance. I'll update once I've more
>>> data.
>>>
>>>
>> Will look forward to these numbers.
>>
>
> I have identified an issue that limits the number of active threads when
> load is high, causing some regressions. I'll fix it and rerun the tests on
> Monday.
>

Once the issue was solved, it caused high load averages for some workloads
that were actually causing a regression (too much I/O I guess) instead of
improving performance. So I added a configurable maximum amount of threads
and made the whole implementation optional, so that it can be safely used
when required.

I did some tests and I was able to, at least, have the same performance we
had before this patch in all cases. In some cases even better. But each
test needed a manual configuration on the number of threads.

I need to work on a way to automatically compute the maximum so that it can
be used easily in any workload (or even combined workloads).

I uploaded the latest version of the patch.

Xavi


> Xavi
>
>
>>
>> Regards,
>> Vijay
>>
>
___
Gluster-devel mailing list
Gluster-devel@gluster.org
https://lists.gluster.org/mailman/listinfo/gluster-devel

Re: [Gluster-devel] Performance improvements

2019-01-26 Thread Xavi Hernandez
On Fri, 25 Jan 2019, 08:53 Vijay Bellur  Thank you for the detailed update, Xavi! This looks very interesting.
>
> On Thu, Jan 24, 2019 at 7:50 AM Xavi Hernandez 
> wrote:
>
>> Hi all,
>>
>> I've just updated a patch [1] that implements a new thread pool based on
>> a wait-free queue provided by userspace-rcu library. The patch also
>> includes an auto scaling mechanism that only keeps running the needed
>> amount of threads for the current workload.
>>
>> This new approach has some advantages:
>>
>>- It's provided globally inside libglusterfs instead of inside an
>>xlator
>>
>> This makes it possible that fuse thread and epoll threads transfer the
>> received request to another thread sooner, wating less CPU and reacting
>> sooner to other incoming requests.
>>
>>
>>- Adding jobs to the queue used by the thread pool only requires an
>>atomic operation
>>
>> This makes the producer side of the queue really fast, almost with no
>> delay.
>>
>>
>>- Contention is reduced
>>
>> The producer side has negligible contention thanks to the wait-free
>> enqueue operation based on an atomic access. The consumer side requires a
>> mutex, but the duration is very small and the scaling mechanism makes sure
>> that there are no more threads than needed contending for the mutex.
>>
>>
>> This change disables io-threads, since it replaces part of its
>> functionality. However there are two things that could be needed from
>> io-threads:
>>
>>- Prioritization of fops
>>
>> Currently, io-threads assigns priorities to each fop, so that some fops
>> are handled before than others.
>>
>>
>>- Fair distribution of execution slots between clients
>>
>> Currently, io-threads processes requests from each client in round-robin.
>>
>>
>> These features are not implemented right now. If they are needed,
>> probably the best thing to do would be to keep them inside io-threads, but
>> change its implementation so that it uses the global threads from the
>> thread pool instead of its own threads.
>>
>
>
> These features are indeed useful to have and hence modifying the
> implementation of io-threads to provide this behavior would be welcome.
>
>
>
>>
>>
>> These tests have shown that the limiting factor has been the disk in most
>> cases, so it's hard to tell if the change has really improved things. There
>> is only one clear exception: self-heal on a dispersed volume completes
>> 12.7% faster. The utilization of CPU has also dropped drastically:
>>
>> Old implementation: 12.30 user, 41.78 sys, 43.16 idle,  0.73 wait
>>
>> New implementation: 4.91 user,  5.52 sys, 81.60 idle,  5.91 wait
>>
>>
>> Now I'm running some more tests on NVMe to try to see the effects of the
>> change when disk is not limiting performance. I'll update once I've more
>> data.
>>
>>
> Will look forward to these numbers.
>

I have identified an issue that limits the number of active threads when
load is high, causing some regressions. I'll fix it and rerun the tests on
Monday.

Xavi


>
> Regards,
> Vijay
>
___
Gluster-devel mailing list
Gluster-devel@gluster.org
https://lists.gluster.org/mailman/listinfo/gluster-devel

[Gluster-devel] Performance improvements

2019-01-24 Thread Xavi Hernandez
Hi all,

I've just updated a patch [1] that implements a new thread pool based on a
wait-free queue provided by userspace-rcu library. The patch also includes
an auto scaling mechanism that only keeps running the needed amount of
threads for the current workload.

This new approach has some advantages:

   - It's provided globally inside libglusterfs instead of inside an xlator

This makes it possible that fuse thread and epoll threads transfer the
received request to another thread sooner, wating less CPU and reacting
sooner to other incoming requests.


   - Adding jobs to the queue used by the thread pool only requires an
   atomic operation

This makes the producer side of the queue really fast, almost with no delay.


   - Contention is reduced

The producer side has negligible contention thanks to the wait-free enqueue
operation based on an atomic access. The consumer side requires a mutex,
but the duration is very small and the scaling mechanism makes sure that
there are no more threads than needed contending for the mutex.


This change disables io-threads, since it replaces part of its
functionality. However there are two things that could be needed from
io-threads:

   - Prioritization of fops

Currently, io-threads assigns priorities to each fop, so that some fops are
handled before than others.


   - Fair distribution of execution slots between clients

Currently, io-threads processes requests from each client in round-robin.


These features are not implemented right now. If they are needed, probably
the best thing to do would be to keep them inside io-threads, but change
its implementation so that it uses the global threads from the thread pool
instead of its own threads.

If this change proves it's performing better and is merged, I have some
more ideas to improve other areas of gluster:

   - Integrate synctask threads into the new thread pool

I think there is some contention in these threads because during some tests
I've seen they were consuming most of the CPU. Probably they suffer from
the same problem than io-threads, so replacing them could improve things.


   - Integrate timers into the new thread pool

My idea is to create a per-thread timer where code executed in one thread
will create timer events in the same thread. This makes it possible to use
structures that don't require any mutex to be modified.

Since the thread pool is basically executing computing tasks, which are
fast, I think it's feasible to implement a timer in the main loop of each
worker thread with a resolution of few millisecond, which I think is good
enough for gluster needs.


   - Integrate with userspace-rcu library in QSBR mode

This will make it possible to use some RCU-based structures for anything
gluster uses (inodes, fd's, ...). These structures have very fast read
operations, which should reduce contention and improve performance in many
places.


   - Integrate I/O threads into the thread pool and reduce context switches

The idea here is a bit more complex. Basically I would like to have a
function that does an I/O on some device (for example reading fuse requests
or waiting for epoll events). We could send a request to the thread pool to
execute that function, so it would be executed inside one of the working
threads. When the I/O terminates (i.e. it has received a request), the idea
is that a call to the same function is added to the thread pool, so that
another thread could continue waiting for requests, but the current thread
will start processing the received request without a context switch.

Note that with all these changes, all dedicated threads that we currently
have in gluster could be replaced by the features provided by this new
thread pool, so these would be the only threads present in gluster. This is
specially important when brick-multiplex is used.

I've done some simple tests using a replica 3 volume and a diserse 4+2
volume. These tests are executed on a single machine using an HDD for each
brick (not the best scenario, but it should be fine for comparison). The
machine is quite powerful (dual Intel Xeon Silver 4114 @2.2 GHz, with 128
GiB RAM).

These tests have shown that the limiting factor has been the disk in most
cases, so it's hard to tell if the change has really improved things. There
is only one clear exception: self-heal on a dispersed volume completes
12.7% faster. The utilization of CPU has also dropped drastically:

Old implementation: 12.30 user, 41.78 sys, 43.16 idle,  0.73 wait

New implementation: 4.91 user,  5.52 sys, 81.60 idle,  5.91 wait


Now I'm running some more tests on NVMe to try to see the effects of the
change when disk is not limiting performance. I'll update once I've more
data.

Xavi

[1] https://review.gluster.org/c/glusterfs/+/20636
___
Gluster-devel mailing list
Gluster-devel@gluster.org
https://lists.gluster.org/mailman/listinfo/gluster-devel

Re: [Gluster-devel] Regression health for release-5.next and release-6

2019-01-15 Thread Xavi Hernandez
On Mon, Jan 14, 2019 at 11:08 AM Ashish Pandey  wrote:

>
> I downloaded logs of regression runs 1077 and 1073 and tried to
> investigate it.
> In both regression ec/bug-1236065.t is hanging on TEST 70  which is trying
> to get the online brick count
>
> I can see that in mount/bricks and glusterd logs it has not move forward
> after this test.
> glusterd.log  -
>
> [2019-01-06 16:27:51.346408]:++
> G_LOG:./tests/bugs/ec/bug-1236065.t: TEST: 70 5 online_brick_count
> ++
> [2019-01-06 16:27:51.645014] I [MSGID: 106499]
> [glusterd-handler.c:4404:__glusterd_handle_status_volume] 0-management:
> Received status volume req for volume patchy
> [2019-01-06 16:27:51.646664] I [dict.c:2745:dict_get_str_boolean]
> (-->/build/install/lib/glusterfs/6dev/xlator/mgmt/glusterd.so(+0x4a6c3)
> [0x7f4c37fe06c3]
> -->/build/install/lib/glusterfs/6dev/xlator/mgmt/glusterd.so(+0x43b3a)
> [0x7f4c37fd9b3a]
> -->/build/install/lib/libglusterfs.so.0(dict_get_str_boolean+0x170)
> [0x7f4c433d83fb] ) 0-dict: key nfs.disable, integer type asked, has string
> type [Invalid argument]
> [2019-01-06 16:27:51.647177] I [dict.c:2361:dict_get_strn]
> (-->/build/install/lib/glusterfs/6dev/xlator/mgmt/glusterd.so(+0xffa32)
> [0x7f4c38095a32]
> -->/build/install/lib/glusterfs/6dev/xlator/mgmt/glusterd.so(+0x474ac)
> [0x7f4c37fdd4ac]
> -->/build/install/lib/libglusterfs.so.0(dict_get_strn+0x179)
> [0x7f4c433d7673] ) 0-dict: key brick0.rdma_port, string type asked, has
> integer type [Invalid argument]
> [2019-01-06 16:27:51.647227] I [dict.c:2361:dict_get_strn]
> (-->/build/install/lib/glusterfs/6dev/xlator/mgmt/glusterd.so(+0xffa32)
> [0x7f4c38095a32]
> -->/build/install/lib/glusterfs/6dev/xlator/mgmt/glusterd.so(+0x474ac)
> [0x7f4c37fdd4ac]
> -->/build/install/lib/libglusterfs.so.0(dict_get_strn+0x179)
> [0x7f4c433d7673] ) 0-dict: key brick1.rdma_port, string type asked, has
> integer type [Invalid argument]
> [2019-01-06 16:27:51.647292] I [dict.c:2361:dict_get_strn]
> (-->/build/install/lib/glusterfs/6dev/xlator/mgmt/glusterd.so(+0xffa32)
> [0x7f4c38095a32]
> -->/build/install/lib/glusterfs/6dev/xlator/mgmt/glusterd.so(+0x474ac)
> [0x7f4c37fdd4ac]
> -->/build/install/lib/libglusterfs.so.0(dict_get_strn+0x179)
> [0x7f4c433d7673] ) 0-dict: key brick2.rdma_port, string type asked, has
> integer type [Invalid argument]
> [2019-01-06 16:27:51.647333] I [dict.c:2361:dict_get_strn]
> (-->/build/install/lib/glusterfs/6dev/xlator/mgmt/glusterd.so(+0xffa32)
> [0x7f4c38095a32]
> -->/build/install/lib/glusterfs/6dev/xlator/mgmt/glusterd.so(+0x474ac)
> [0x7f4c37fdd4ac]
> -->/build/install/lib/libglusterfs.so.0(dict_get_strn+0x179)
> [0x7f4c433d7673] ) 0-dict: key brick3.rdma_port, string type asked, has
> integer type [Invalid argument]
> [2019-01-06 16:27:51.647371] I [dict.c:2361:dict_get_strn]
> (-->/build/install/lib/glusterfs/6dev/xlator/mgmt/glusterd.so(+0xffa32)
> [0x7f4c38095a32]
> -->/build/install/lib/glusterfs/6dev/xlator/mgmt/glusterd.so(+0x474ac)
> [0x7f4c37fdd4ac]
> -->/build/install/lib/libglusterfs.so.0(dict_get_strn+0x179)
> [0x7f4c433d7673] ) 0-dict: key brick4.rdma_port, string type asked, has
> integer type [Invalid argument]
> [2019-01-06 16:27:51.647409] I [dict.c:2361:dict_get_strn]
> (-->/build/install/lib/glusterfs/6dev/xlator/mgmt/glusterd.so(+0xffa32)
> [0x7f4c38095a32]
> -->/build/install/lib/glusterfs/6dev/xlator/mgmt/glusterd.so(+0x474ac)
> [0x7f4c37fdd4ac]
> -->/build/install/lib/libglusterfs.so.0(dict_get_strn+0x179)
> [0x7f4c433d7673] ) 0-dict: key brick5.rdma_port, string type asked, has
> integer type [Invalid argument]
> [2019-01-06 16:27:51.647447] I [dict.c:2361:dict_get_strn]
> (-->/build/install/lib/glusterfs/6dev/xlator/mgmt/glusterd.so(+0xffa32)
> [0x7f4c38095a32]
> -->/build/install/lib/glusterfs/6dev/xlator/mgmt/glusterd.so(+0x474ac)
> [0x7f4c37fdd4ac]
> -->/build/install/lib/libglusterfs.so.0(dict_get_strn+0x179)
> [0x7f4c433d7673] ) 0-dict: key brick6.rdma_port, string type asked, has
> integer type [Invalid argument]
> [2019-01-06 16:27:51.649335] E [MSGID: 101191]
> [event-epoll.c:759:event_dispatch_epoll_worker] 0-epoll: Failed to dispatch
> handler
> [2019-01-06 16:27:51.932871] I [MSGID: 106499]
> [glusterd-handler.c:4404:__glusterd_handle_status_volume] 0-management:
> Received status volume req for volume patchy
>
> It is just taking lot of time to get the status at this point.
> It looks like there could be some issue with connection or the handing of
> volume status when some bricks are down.
>

The 'online_brick_count' check uses 'gluster volume status' to get some
information, and it does that several times (currently 7). Looking at
cmd_history.log, I see that after the 'online_brick_count' at line 70, only
one 'gluster volume status' has completed. Apparently the second 'gluster
volume status' is hung.

In cli.log I see that the second 'gluster volume status' seems to have
started, but not finished:

Normal run:

[2019-01-08 16:36:43.628821] I [cli.c:834:main] 0-cli: Started 

Re: [Gluster-devel] Latency analysis of GlusterFS' network layer for pgbench

2019-01-01 Thread Xavi Hernandez
On Mon, Dec 24, 2018 at 11:30 AM Sankarshan Mukhopadhyay <
sankarshan.mukhopadh...@gmail.com> wrote:

> [pulling the conclusions up to enable better in-line]
>
> > Conclusions:
> >
> > We should never have a volume with caching-related xlators disabled. The
> price we pay for it is too high. We need to make them work consistently and
> aggressively to avoid as many requests as we can.
>
> Are there current issues in terms of behavior which are known/observed
> when these are enabled?
>
> > We need to analyze client/server xlators deeper to see if we can avoid
> some delays. However optimizing something that is already at the
> microsecond level can be very hard.
>
> That is true - are there any significant gains which can be accrued by
> putting efforts here or, should this be a lower priority?
>

I would say that for volumes based on spinning disks this is not a high
priority, but if we want to provide good performance for NVME storage, this
is something that needs to be done. On NVME, reads and writes can be served
in few tens of microseconds, so adding 100 us in the network layer could
easily mean a performance reduction of 70% or more.


> > We need to determine what causes the fluctuations in brick side and
> avoid them.
> > This scenario is very similar to a smallfile/metadata workload, so this
> is probably one important cause of its bad performance.
>
> What kind of instrumentation is required to enable the determination?
>
> On Fri, Dec 21, 2018 at 1:48 PM Xavi Hernandez 
> wrote:
> >
> > Hi,
> >
> > I've done some tracing of the latency that network layer introduces in
> gluster. I've made the analysis as part of the pgbench performance issue
> (in particulat the initialization and scaling phase), so I decided to look
> at READV for this particular workload, but I think the results can be
> extrapolated to other operations that also have small latency (cached data
> from FS for example).
> >
> > Note that measuring latencies introduces some latency. It consists in a
> call to clock_get_time() for each probe point, so the real latency will be
> a bit lower, but still proportional to these numbers.
> >
>
> [snip]
> ___
> Gluster-devel mailing list
> Gluster-devel@gluster.org
> https://lists.gluster.org/mailman/listinfo/gluster-devel
>
___
Gluster-devel mailing list
Gluster-devel@gluster.org
https://lists.gluster.org/mailman/listinfo/gluster-devel

[Gluster-devel] Latency analysis of GlusterFS' network layer for pgbench

2018-12-21 Thread Xavi Hernandez
Hi,

I've done some tracing of the latency that network layer introduces in
gluster. I've made the analysis as part of the pgbench performance issue
(in particulat the initialization and scaling phase), so I decided to look
at READV for this particular workload, but I think the results can be
extrapolated to other operations that also have small latency (cached data
from FS for example).

Note that measuring latencies introduces some latency. It consists in a
call to clock_get_time() for each probe point, so the real latency will be
a bit lower, but still proportional to these numbers.

Raw results (times in microseconds):

Client:
 Min  Avg   Max   FromTo
1.2053.55450.371  client_readv ->
client_submit_request
0.8282.47244.811  client_submit_request->
rpc_clnt_submit
0.7925.347   267.897  rpc_clnt_submit  ->
socket_submit_outgoing_msg
4.337   11.229   242.354  socket_submit_outgoing_msg   -> msg sent
   25.447   99.988 49648.239  msg sent ->
__socket_proto_state_machine   Time taken by kernel + network + brick
9.258   21.838   544.534  __socket_proto_state_machine ->
rpc_clnt_handle_reply
0.4371.17692.899  rpc_clnt_handle_reply->
client4_0_readv_cbk
0.3590.75031.189  client4_0_readv_cbk  -> UNWIND

Server:
 Min  Avg   Max   FromTo
5.145   11.891   326.291  __socket_proto_state_machine ->
rpcsvc_handle_rpc_call
1.9434.036   160.085  rpcsvc_handle_rpc_call   -> req queued
0.3618.989 33658.985  req queued   ->
rpcsvc_request_handler
0.1250.59147.348  rpcsvc_request_handler   ->
server4_0_readv
2.0645.373   643.653  server4_0_readv  ->
server4_readv_resume
   14.610   33.766   641.871  server4_readv_resume ->
server4_readv_cbk
0.1440.34923.488  server4_readv_cbk->
server_submit_reply
1.0182.30239.741  server_submit_reply  ->
rpcsvc_submit_generic
0.6311.60842.477  rpcsvc_submit_generic->
socket_submit_outgoing_msg
6.013   22.009 48756.083  socket_submit_outgoing_msg   -> msg sent

The total number of read requests is ~85.

If we look only at the averages, we can see that the latency of the READV
on posix is very low, ~33 us (the brick is a NVME, so this is expected).

The total latency on the brick side is 90.914 us (sum of all avg
latencies). This seems consistent with the latency the client is seeing
between the time the message is sent and the time the answer arrives.
There's a 9 us gap that probably can be attributed to kernel processing
(I'm using loopback device, so no real network latencies here).

This means that operations suffer from a ~70 us delay. This is not
important when the operation takes in the order of milliseconds, like reads
and writes on spinning disks, but for very fast operations (bricks on NVME
and most of the metadata operations when the information is already cached
on the brick) this means a factor of x3 times slower. So when we have huge
amounts of operations with a very small latency, the overhead is huge.

If we also add the latencies on the client side, we are talking of about
146 us per read request, while the brick is able to serve them in 33 us.
This is 4.4 times slower.

It's interesting to see the amount of latency that sending a message
introduces: In the client side we have 11 us (it's a very small request
containing only some values, like offset and size). On the brick it also
takes 11 us to read it, and 22 us to send the answer. It's higher probably
because it also sends the 8KB block read from disk. On the client side we
can see that it also takes ~22 us to read the answer. It could be
attributed to system calls needed to send the data through the socket. In
this case we could try to send/recv all data with a single call. Anyway,
the minimum latency is smaller which makes me think that not all latency is
caused by system calls. We can also see that times on brick are very
unstable: the time needed to process requests queued to another thread
seems quite high sometimes. It's also high the maximum time that it could
take to send the answer.

For the particular pgbench case, the number of requests is huge. With a
scaling factor of just 100, it sends hundreds of thousands of requests. As
an example, this run sent:

   - 520154 FSTAT (24.25 us avg latency on brick)
   - 847900 READV (31.30 us avg latency on brick)
   - 708850 WRITEV (64.39 us avg latency on brick)

Considering a total overhead of 113 us per request, we have 235 seconds of
overhead only on the network layer. Considering that the whole test took
476 seconds, this represents ~50% of the total time.

Conclusions:

   - We should never have a volume with caching-related xlators disabled.
   The price we pay for it is too high. We need to 

Re: [Gluster-devel] [Gluster-Maintainers] Release 5: Branched and further dates

2018-10-11 Thread Xavi Hernandez
On Wed, Oct 10, 2018 at 10:03 PM Shyam Ranganathan 
wrote:

> On 09/26/2018 10:21 AM, Shyam Ranganathan wrote:
> > 3. Upgrade testing
> >   - Need *volunteers* to do the upgrade testing as stated in the 4.1
> > upgrade guide [3] to note any differences or changes to the same
> >   - Explicit call out on *disperse* volumes, as we continue to state
> > online upgrade is not possible, is this addressed and can this be tested
> > and the documentation improved around the same?
>
> Completed upgrade testing using RC1 packages against a 4.1 cluster.
> Things hold up fine. (replicate type volumes)
>
> I have not attempted a rolling upgrade of disperse volumes, as we still
> lack instructions to do so. @Pranith/@Xavi is this feasible this release
> onward?
>

There were some problems with optimistic-change-log option. I think that
disabling it before upgrading (and give some time to become effective)
makes it possible to successfully complete a rolling upgrade, but I've not
tested it personally.

Adding @Ashish Pandey  who tested it, and may know
more details about the procedure.

Xavi


> Shyam
>
___
Gluster-devel mailing list
Gluster-devel@gluster.org
https://lists.gluster.org/mailman/listinfo/gluster-devel

Re: [Gluster-devel] POC- Distributed regression testing framework

2018-10-04 Thread Xavi Hernandez
On Thu, Oct 4, 2018 at 9:47 AM Amar Tumballi  wrote:

>
>
> On Thu, Oct 4, 2018 at 12:54 PM Xavi Hernandez 
> wrote:
>
>> On Wed, Oct 3, 2018 at 11:57 AM Deepshikha Khandelwal <
>> dkhan...@redhat.com> wrote:
>>
>>> Hello folks,
>>>
>>> Distributed-regression job[1] is now a part of Gluster's
>>> nightly-master build pipeline. The following are the issues we have
>>> resolved since we started working on this:
>>>
>>> 1) Collecting gluster logs from servers.
>>> 2) Tests failed due to infra-related issues have been fixed.
>>> 3) Time taken to run regression testing reduced to ~50-60 minutes.
>>>
>>> To get time down to 40 minutes needs your help!
>>>
>>> Currently, there is a test that is failing:
>>>
>>> tests/bugs/glusterd/optimized-basic-testcases-in-cluster.t
>>>
>>> This needs fixing first.
>>>
>>> There's a test that takes 14 minutes to complete -
>>> `tests/bugs/index/bug-1559004-EMLINK-handling.t`. A single test taking
>>> 14 minutes is not something we can distribute. Can we look at how we
>>> can speed this up[2]? When this test fails, it is re-attempted,
>>> further increasing the time. This happens in the regular
>>> centos7-regression job as well.
>>>
>>
>> I made a change [1] to reduce the amount of time this tests needs. With
>> this change the test completes in about 90 seconds. It would need some
>> reviews from maintainers though.
>>
>> Do you want I send a patch with this change alone ?
>>
>> Xavi
>>
>> [1]
>> https://review.gluster.org/#/c/glusterfs/+/19254/22/tests/bugs/index/bug-1559004-EMLINK-handling.t
>>
>>
>
> Yes please! It would be useful! We can merge it sooner that way!
>

Patch: https://review.gluster.org/21341


>
> -Amar
>
>
>>
>>> If you see any other issues, please file a bug[3].
>>>
>>> [1]: https://build.gluster.org/job/distributed-regression
>>> [2]: https://build.gluster.org/job/distributed-regression/264/console
>>> [3]:
>>> https://bugzilla.redhat.com/enter_bug.cgi?product=glusterfs=project-infrastructure
>>>
>>> Thanks,
>>> Deepshikha Khandelwal
>>> On Tue, Jun 26, 2018 at 9:02 AM Nigel Babu  wrote:
>>> >
>>> >
>>> >
>>> > On Mon, Jun 25, 2018 at 7:28 PM Amar Tumballi 
>>> wrote:
>>> >>
>>> >>
>>> >>
>>> >>> There are currently a few known issues:
>>> >>> * Not collecting the entire logs (/var/log/glusterfs) from servers.
>>> >>
>>> >>
>>> >> If I look at the activities involved with regression failures, this
>>> can wait.
>>> >
>>> >
>>> > Well, we can't debug the current failures without having the logs. So
>>> this has to be fixed first.
>>> >
>>> >>
>>> >>
>>> >>>
>>> >>> * A few tests fail due to infra-related issues like geo-rep tests.
>>> >>
>>> >>
>>> >> Please open bugs for this, so we can track them, and take it to
>>> closure.
>>> >
>>> >
>>> > These are failing due to infra reasons. Most likely subtle differences
>>> in the setup of these nodes vs our normal nodes. We'll only be able to
>>> debug them once we get the logs. I know the geo-rep ones are easy to fix.
>>> The playbook for setting up geo-rep correctly just didn't make it over to
>>> the playbook used for these images.
>>> >
>>> >>
>>> >>
>>> >>>
>>> >>> * Takes ~80 minutes with 7 distributed servers (targetting 60
>>> minutes)
>>> >>
>>> >>
>>> >> Time can change with more tests added, and also please plan to have
>>> number of server as 1 to n.
>>> >
>>> >
>>> > While the n is configurable, however it will be fixed to a single
>>> digit number for now. We will need to place *some* limitation somewhere or
>>> else we'll end up not being able to control our cloud bills.
>>> >
>>> >>
>>> >>
>>> >>>
>>> >>> * We've only tested plain regressions. ASAN and Valgrind are
>>> currently untested.
>>> >>
>>> >>
>>> >> Great to have it running not 'per patch', but as nightly, or weekly
>>

Re: [Gluster-devel] POC- Distributed regression testing framework

2018-10-04 Thread Xavi Hernandez
On Wed, Oct 3, 2018 at 11:57 AM Deepshikha Khandelwal 
wrote:

> Hello folks,
>
> Distributed-regression job[1] is now a part of Gluster's
> nightly-master build pipeline. The following are the issues we have
> resolved since we started working on this:
>
> 1) Collecting gluster logs from servers.
> 2) Tests failed due to infra-related issues have been fixed.
> 3) Time taken to run regression testing reduced to ~50-60 minutes.
>
> To get time down to 40 minutes needs your help!
>
> Currently, there is a test that is failing:
>
> tests/bugs/glusterd/optimized-basic-testcases-in-cluster.t
>
> This needs fixing first.
>
> There's a test that takes 14 minutes to complete -
> `tests/bugs/index/bug-1559004-EMLINK-handling.t`. A single test taking
> 14 minutes is not something we can distribute. Can we look at how we
> can speed this up[2]? When this test fails, it is re-attempted,
> further increasing the time. This happens in the regular
> centos7-regression job as well.
>

I made a change [1] to reduce the amount of time this tests needs. With
this change the test completes in about 90 seconds. It would need some
reviews from maintainers though.

Do you want I send a patch with this change alone ?

Xavi

[1]
https://review.gluster.org/#/c/glusterfs/+/19254/22/tests/bugs/index/bug-1559004-EMLINK-handling.t


>
> If you see any other issues, please file a bug[3].
>
> [1]: https://build.gluster.org/job/distributed-regression
> [2]: https://build.gluster.org/job/distributed-regression/264/console
> [3]:
> https://bugzilla.redhat.com/enter_bug.cgi?product=glusterfs=project-infrastructure
>
> Thanks,
> Deepshikha Khandelwal
> On Tue, Jun 26, 2018 at 9:02 AM Nigel Babu  wrote:
> >
> >
> >
> > On Mon, Jun 25, 2018 at 7:28 PM Amar Tumballi 
> wrote:
> >>
> >>
> >>
> >>> There are currently a few known issues:
> >>> * Not collecting the entire logs (/var/log/glusterfs) from servers.
> >>
> >>
> >> If I look at the activities involved with regression failures, this can
> wait.
> >
> >
> > Well, we can't debug the current failures without having the logs. So
> this has to be fixed first.
> >
> >>
> >>
> >>>
> >>> * A few tests fail due to infra-related issues like geo-rep tests.
> >>
> >>
> >> Please open bugs for this, so we can track them, and take it to closure.
> >
> >
> > These are failing due to infra reasons. Most likely subtle differences
> in the setup of these nodes vs our normal nodes. We'll only be able to
> debug them once we get the logs. I know the geo-rep ones are easy to fix.
> The playbook for setting up geo-rep correctly just didn't make it over to
> the playbook used for these images.
> >
> >>
> >>
> >>>
> >>> * Takes ~80 minutes with 7 distributed servers (targetting 60 minutes)
> >>
> >>
> >> Time can change with more tests added, and also please plan to have
> number of server as 1 to n.
> >
> >
> > While the n is configurable, however it will be fixed to a single digit
> number for now. We will need to place *some* limitation somewhere or else
> we'll end up not being able to control our cloud bills.
> >
> >>
> >>
> >>>
> >>> * We've only tested plain regressions. ASAN and Valgrind are currently
> untested.
> >>
> >>
> >> Great to have it running not 'per patch', but as nightly, or weekly to
> start with.
> >
> >
> > This is currently not targeted until we phase out current regressions.
> >
> >>>
> >>>
> >>> Before bringing it into production, we'll run this job nightly and
> >>> watch it for a month to debug the other failures.
> >>>
> >>
> >> I would say, bring it to production sooner, say 2 weeks, and also plan
> to have the current regression as is with a special command like 'run
> regression in-one-machine' in gerrit (or something similar) with voting
> rights, so we can fall back to this method if something is broken in
> parallel testing.
> >>
> >> I have seen that regardless of amount of time we put some scripts in
> testing, the day we move to production, some thing would be broken. So, let
> that happen earlier than later, so it would help next release branching
> out. Don't want to be stuck for branching due to infra failures.
> >
> >
> > Having two regression jobs that can vote is going to cause more
> confusion than it's worth. There are a couple of intermittent memory issues
> with the test script that we need to debug and fix before I'm comfortable
> in making this job a voting job. We've worked around these problems right
> now. It still pops up now and again. The fact that things break often is
> not an excuse to prevent avoidable failures.  The one month timeline was
> taken with all these factors into consideration. The 2-week timeline is a
> no-go at this point.
> >
> > When we are ready to make the switch, we won't be switching 100% of the
> job. We'll start with a sliding scale so that we can monitor failures and
> machine creation adequately.
> >
> > --
> > nigelb
> ___
> Gluster-devel mailing list
> Gluster-devel@gluster.org
> 

[Gluster-devel] Gluster performance updates

2018-10-01 Thread Xavi Hernandez
Hi,

this is an update containing some work done regarding performance and
consistency during latest weeks. We'll try to build a complete list of all
known issues and track them through this email thread. Please, let me know
of any performance issue not included in this email so that we can build
and track all of them.

*New improvements*

While testing performance on Red Hat products, we have identified a problem
in the way eager-locking was working on replicate volumes for some
scenarios (virtualization and database workloads were affected). It caused
an unnecessary amount of finodelk and fxattrop requests, that was
increasing latency of write operations.

This has already been fixed with patches [1] and [2].

We have also identified some additional settings that provide better
performance for database workloads. A patch [3] to update the default
database profile with the new settings has been merged.

Combining all these changes (AFR fix and settings), pgbench performance has
improved ~300% on bare metal using NVME, and a random I/O fio test running
on VM has also improved more than 300%.

*Known issues*

We have identified two issues in fuse mounts:

   - Becasue of selinux in client machine, a getxattr request is sent by
   fuse before each write request. Though it adds some latency, currently this
   request is directly answered by fuse xlator when selinux is not enabled in
   gluster (default setting).


   - When *fopen-keep-cache* is enabled (default setting), kernel fuse
   sends stat requests before each read. Even disabling fopen-keep-cache, fuse
   still sends half of the stat requests. This has been tracked down to the
   atime update, however mounting a volume with noatime doesn't help to solve
   the issue because kernel fuse doesn't correctly handle noatime setting.

Some other issues are detected:

   - Bad performance of write-behind when stat and writes to the same file
   are mixed. Right now, when a stat is received, all previous cached writes
   are flushed before processing the new request. The same happens for reads
   when it overlaps with a cached previous write. This makes write-behind
   useless in this scenario.

*Note*: fuse is currently sending stat requests before reads (see previous
known issue), making reads almost as problematic as stat requests.


   - Self-heal seems to be slow. It's still being investigated but there
   are some indications that we have a considerable amount of contention in
   io-threads. This contention could be the cause of some other performance
   issues, but we'll need to investigate more about this. There is already
   some work [4] trying to reduce it.


   - 'ls' performance is not good in some cases. When the volume has many
   bricks, 'ls' performance tends to degrade. We are still investigating the
   cause, but one important factor is that DHT sends readdir(p) requests to
   all its subvolumes, This means that 'ls' will run at the speed of the
   slower of the bricks. If any brick has an issue, or a spike in load, even
   if it's transitory, it will have a bad impact in 'ls' performance. This can
   be alleviated by enabling parallel-readdir and readdir-ahead option.

*Note*: There have been reports that enabling parallel-readdir causes some
entries to apparently disappear after some time (though they are still
present on the bricks). I'm not aware of the root cause yet.


   - The number of threads in a server is quite high when multiple bricks
   are present, even if brick-mux is used. There are some efforts [5] trying
   to reduce this number.


*New features*

We have recently started the design [6] of a new caching infrastructure
that should provide much better performance, specially for small files or
metadata intensive workloads. It should also provide a safe infrastructure
to keep cached information consistent on all clients.

This framework will make caching features available to any xlator that
could need them in an easy and safe way.

The current thinking is that current existing caching xlators (mostly
md-cache, io-cache and write-behind) will probably be reworked as a single
complete caching xlator, since this makes things easier.

Any feedback or ideas will be highly appreciated.

Xavi

[1] https://review.gluster.org/21107
[2] https://review.gluster.org/21210
[3] https://review.gluster.org/21247
[4] https://review.gluster.org/21039
[5] https://review.gluster.org/20859
[6]
https://docs.google.com/document/d/1elX-WZfPWjfTdJxXhgwq37CytRehPO4D23aaVowtiE8/edit?usp=sharing
___
Gluster-devel mailing list
Gluster-devel@gluster.org
https://lists.gluster.org/mailman/listinfo/gluster-devel

[Gluster-devel] Gluster performance improvements

2018-09-25 Thread Xavi Hernandez
Hi,

we are starting to design the next cache implementation for gluster that
should provide much better latencies, increasing performance. The document
[1] with the high level approach will be used as a starting point to design
the final architecture. Any comments will be highly appreciated so that we
can converge to the best design and start implementing it soon.

This can also be tracked on GitHub [2].

Best regards,

Xavi

[1]
https://docs.google.com/document/d/1elX-WZfPWjfTdJxXhgwq37CytRehPO4D23aaVowtiE8/edit?usp=sharing
[2] https://github.com/gluster/glusterfs/issues/218
___
Gluster-devel mailing list
Gluster-devel@gluster.org
https://lists.gluster.org/mailman/listinfo/gluster-devel

Re: [Gluster-devel] [Gluster-Maintainers] Release 5: Master branch health report (Week of 30th July)

2018-08-02 Thread Xavi Hernandez
On Thu, Aug 2, 2018 at 1:42 PM Kotresh Hiremath Ravishankar <
khire...@redhat.com> wrote:

>
>
> On Thu, Aug 2, 2018 at 5:05 PM, Atin Mukherjee  > wrote:
>
>>
>>
>> On Thu, Aug 2, 2018 at 4:37 PM Kotresh Hiremath Ravishankar <
>> khire...@redhat.com> wrote:
>>
>>>
>>>
>>> On Thu, Aug 2, 2018 at 3:49 PM, Xavi Hernandez 
>>> wrote:
>>>
>>>> On Thu, Aug 2, 2018 at 6:14 AM Atin Mukherjee 
>>>> wrote:
>>>>
>>>>>
>>>>>
>>>>> On Tue, Jul 31, 2018 at 10:11 PM Atin Mukherjee 
>>>>> wrote:
>>>>>
>>>>>> I just went through the nightly regression report of brick mux runs
>>>>>> and here's what I can summarize.
>>>>>>
>>>>>>
>>>>>> =
>>>>>> Fails only with brick-mux
>>>>>>
>>>>>> =
>>>>>> tests/bugs/core/bug-1432542-mpx-restart-crash.t - Times out even
>>>>>> after 400 secs. Refer
>>>>>> https://fstat.gluster.org/failure/209?state=2_date=2018-06-30_date=2018-07-31=all,
>>>>>> specifically the latest report
>>>>>> https://build.gluster.org/job/regression-test-burn-in/4051/consoleText
>>>>>> . Wasn't timing out as frequently as it was till 12 July. But since 27
>>>>>> July, it has timed out twice. Beginning to believe commit
>>>>>> 9400b6f2c8aa219a493961e0ab9770b7f12e80d2 has added the delay and now 400
>>>>>> secs isn't sufficient enough (Mohit?)
>>>>>>
>>>>>> tests/bugs/glusterd/add-brick-and-validate-replicated-volume-options.t
>>>>>> (Ref -
>>>>>> https://build.gluster.org/job/regression-test-with-multiplex/814/console)
>>>>>> -  Test fails only in brick-mux mode, AI on Atin to look at and get back.
>>>>>>
>>>>>> tests/bugs/replicate/bug-1433571-undo-pending-only-on-up-bricks.t (
>>>>>> https://build.gluster.org/job/regression-test-with-multiplex/813/console)
>>>>>> - Seems like failed just twice in last 30 days as per
>>>>>> https://fstat.gluster.org/failure/251?state=2_date=2018-06-30_date=2018-07-31=all.
>>>>>> Need help from AFR team.
>>>>>>
>>>>>> tests/bugs/quota/bug-1293601.t (
>>>>>> https://build.gluster.org/job/regression-test-with-multiplex/812/console)
>>>>>> - Hasn't failed after 26 July and earlier it was failing regularly. Did 
>>>>>> we
>>>>>> fix this test through any patch (Mohit?)
>>>>>>
>>>>>> tests/bitrot/bug-1373520.t - (
>>>>>> https://build.gluster.org/job/regression-test-with-multiplex/811/console)
>>>>>> - Hasn't failed after 27 July and earlier it was failing regularly. Did 
>>>>>> we
>>>>>> fix this test through any patch (Mohit?)
>>>>>>
>>>>>
>>>>> I see this has failed in day before yesterday's regression run as well
>>>>> (and I could reproduce it locally with brick mux enabled). The test fails
>>>>> in healing a file within a particular time period.
>>>>>
>>>>> *15:55:19* not ok 25 Got "0" instead of "512", LINENUM:55*15:55:19* 
>>>>> FAILED COMMAND: 512 path_size /d/backends/patchy5/FILE1
>>>>>
>>>>> Need EC dev's help here.
>>>>>
>>>>
>>>> I'm not sure where the problem is exactly. I've seen that when the test
>>>> fails, self-heal is attempting to heal the file, but when the file is
>>>> accessed, an Input/Output error is returned, aborting heal. I've checked
>>>> that a heal is attempted every time the file is accessed, but it fails
>>>> always. This error seems to come from bit-rot stub xlator.
>>>>
>>>> When in this situation, if I stop and start the volume, self-heal
>>>> immediately heals the files. It seems like an stale state that is kept by
>>>> the stub xlator, preventing the file from being healed.
>>>>
>&

Re: [Gluster-devel] [Gluster-Maintainers] Release 5: Master branch health report (Week of 30th July)

2018-08-02 Thread Xavi Hernandez
On Thu, Aug 2, 2018 at 6:14 AM Atin Mukherjee  wrote:

>
>
> On Tue, Jul 31, 2018 at 10:11 PM Atin Mukherjee 
> wrote:
>
>> I just went through the nightly regression report of brick mux runs and
>> here's what I can summarize.
>>
>>
>> =
>> Fails only with brick-mux
>>
>> =
>> tests/bugs/core/bug-1432542-mpx-restart-crash.t - Times out even after
>> 400 secs. Refer
>> https://fstat.gluster.org/failure/209?state=2_date=2018-06-30_date=2018-07-31=all,
>> specifically the latest report
>> https://build.gluster.org/job/regression-test-burn-in/4051/consoleText .
>> Wasn't timing out as frequently as it was till 12 July. But since 27 July,
>> it has timed out twice. Beginning to believe commit
>> 9400b6f2c8aa219a493961e0ab9770b7f12e80d2 has added the delay and now 400
>> secs isn't sufficient enough (Mohit?)
>>
>> tests/bugs/glusterd/add-brick-and-validate-replicated-volume-options.t
>> (Ref -
>> https://build.gluster.org/job/regression-test-with-multiplex/814/console)
>> -  Test fails only in brick-mux mode, AI on Atin to look at and get back.
>>
>> tests/bugs/replicate/bug-1433571-undo-pending-only-on-up-bricks.t (
>> https://build.gluster.org/job/regression-test-with-multiplex/813/console)
>> - Seems like failed just twice in last 30 days as per
>> https://fstat.gluster.org/failure/251?state=2_date=2018-06-30_date=2018-07-31=all.
>> Need help from AFR team.
>>
>> tests/bugs/quota/bug-1293601.t (
>> https://build.gluster.org/job/regression-test-with-multiplex/812/console)
>> - Hasn't failed after 26 July and earlier it was failing regularly. Did we
>> fix this test through any patch (Mohit?)
>>
>> tests/bitrot/bug-1373520.t - (
>> https://build.gluster.org/job/regression-test-with-multiplex/811/console)
>> - Hasn't failed after 27 July and earlier it was failing regularly. Did we
>> fix this test through any patch (Mohit?)
>>
>
> I see this has failed in day before yesterday's regression run as well
> (and I could reproduce it locally with brick mux enabled). The test fails
> in healing a file within a particular time period.
>
> *15:55:19* not ok 25 Got "0" instead of "512", LINENUM:55*15:55:19* FAILED 
> COMMAND: 512 path_size /d/backends/patchy5/FILE1
>
> Need EC dev's help here.
>

I'll investigate this.


>
>
>> tests/bugs/glusterd/remove-brick-testcases.t - Failed once with a core,
>> not sure if related to brick mux or not, so not sure if brick mux is
>> culprit here or not. Ref -
>> https://build.gluster.org/job/regression-test-with-multiplex/806/console
>> . Seems to be a glustershd crash. Need help from AFR folks.
>>
>>
>> =
>> Fails for non-brick mux case too
>>
>> =
>> tests/bugs/distribute/bug-1122443.t 0 Seems to be failing at my setup
>> very often, with out brick mux as well. Refer
>> https://build.gluster.org/job/regression-test-burn-in/4050/consoleText .
>> There's an email in gluster-devel and a BZ 1610240 for the same.
>>
>> tests/bugs/bug-1368312.t - Seems to be recent failures (
>> https://build.gluster.org/job/regression-test-with-multiplex/815/console)
>> - seems to be a new failure, however seen this for a non-brick-mux case too
>> - https://build.gluster.org/job/regression-test-burn-in/4039/consoleText
>> . Need some eyes from AFR folks.
>>
>> tests/00-geo-rep/georep-basic-dr-tarssh.t - this isn't specific to brick
>> mux, have seen this failing at multiple default regression runs. Refer
>> https://fstat.gluster.org/failure/392?state=2_date=2018-06-30_date=2018-07-31=all
>> . We need help from geo-rep dev to root cause this earlier than later
>>
>> tests/00-geo-rep/georep-basic-dr-rsync.t - this isn't specific to brick
>> mux, have seen this failing at multiple default regression runs. Refer
>> https://fstat.gluster.org/failure/393?state=2_date=2018-06-30_date=2018-07-31=all
>> . We need help from geo-rep dev to root cause this earlier than later
>>
>> tests/bugs/glusterd/validating-server-quorum.t (
>> https://build.gluster.org/job/regression-test-with-multiplex/810/console)
>> - Fails for non-brick-mux cases too,
>> https://fstat.gluster.org/failure/580?state=2_date=2018-06-30_date=2018-07-31=all
>> .  Atin has a patch https://review.gluster.org/20584 which resolves it
>> but patch is failing regression for a different test which is unrelated.
>>
>> 

Re: [Gluster-devel] [Gluster-infra] bug-1432542-mpx-restart-crash.t failing

2018-07-09 Thread Xavi Hernandez
On Mon, Jul 9, 2018 at 11:14 AM Karthik Subrahmanya 
wrote:

> Hi Deepshikha,
>
> Are you looking into this failure? I can still see this happening for all
> the regression runs.
>

I've executed the failing script on my laptop and all tests finish
relatively fast. What seems to take time is the final cleanup. I can see
'semanage' taking some CPU during destruction of volumes. The test required
350 seconds to finish successfully.

Not sure what caused the cleanup time to increase, but I've created a bug
[1] to track this and a patch [2] to give more time to this test. This
should allow all blocked regressions to complete successfully.

Xavi

[1] https://bugzilla.redhat.com/show_bug.cgi?id=1599250
[2] https://review.gluster.org/20482


> Thanks & Regards,
> Karthik
>
> On Sun, Jul 8, 2018 at 7:18 AM Atin Mukherjee  wrote:
>
>>
>> https://build.gluster.org/job/regression-test-with-multiplex/794/display/redirect
>> has the same test failing. Is the reason of the failure different given
>> this is on jenkins?
>>
>> On Sat, 7 Jul 2018 at 19:12, Deepshikha Khandelwal 
>> wrote:
>>
>>> Hi folks,
>>>
>>> The issue[1] has been resolved. Now the softserve instance will be
>>> having 2GB RAM i.e. same as that of the Jenkins builder's sizing
>>> configurations.
>>>
>>> [1] https://github.com/gluster/softserve/issues/40
>>>
>>> Thanks,
>>> Deepshikha Khandelwal
>>>
>>> On Fri, Jul 6, 2018 at 6:14 PM, Karthik Subrahmanya 
>>> wrote:
>>> >
>>> >
>>> > On Fri 6 Jul, 2018, 5:18 PM Deepshikha Khandelwal, <
>>> dkhan...@redhat.com>
>>> > wrote:
>>> >>
>>> >> Hi Poornima/Karthik,
>>> >>
>>> >> We've looked into the memory error that this softserve instance have
>>> >> showed up. These machine instances have 1GB RAM which is not in the
>>> >> case with the Jenkins builder. It's 2GB RAM there.
>>> >>
>>> >> We've created the issue [1] and will solve it sooner.
>>> >
>>> > Great. Thanks for the update.
>>> >>
>>> >>
>>> >> Sorry for the inconvenience.
>>> >>
>>> >> [1] https://github.com/gluster/softserve/issues/40
>>> >>
>>> >> Thanks,
>>> >> Deepshikha Khandelwal
>>> >>
>>> >> On Fri, Jul 6, 2018 at 3:44 PM, Karthik Subrahmanya <
>>> ksubr...@redhat.com>
>>> >> wrote:
>>> >> > Thanks Poornima for the analysis.
>>> >> > Can someone work on fixing this please?
>>> >> >
>>> >> > ~Karthik
>>> >> >
>>> >> > On Fri, Jul 6, 2018 at 3:17 PM Poornima Gurusiddaiah
>>> >> > 
>>> >> > wrote:
>>> >> >>
>>> >> >> The same test case is failing for my patch as well [1]. I
>>> requested for
>>> >> >> a
>>> >> >> regression system and tried to reproduce it.
>>> >> >> From my analysis, the brick process (mutiplexed) is consuming a
>>> lot of
>>> >> >> memory, and is being OOM killed. The regression has 1GB ram and the
>>> >> >> process
>>> >> >> is consuming more than 1GB. 1GB for 120 bricks is acceptable
>>> >> >> considering
>>> >> >> there is 1000 threads in that brick process.
>>> >> >> Ways to fix:
>>> >> >> - Increase the regression system RAM size OR
>>> >> >> - Decrease the number of volumes in the test case.
>>> >> >>
>>> >> >> But what is strange is why the test passes sometimes for some
>>> patches.
>>> >> >> There could be some bug/? in memory consumption.
>>> >> >>
>>> >> >> Regards,
>>> >> >> Poornima
>>> >> >>
>>> >> >>
>>> >> >> On Fri, Jul 6, 2018 at 2:11 PM, Karthik Subrahmanya
>>> >> >> 
>>> >> >> wrote:
>>> >> >>>
>>> >> >>> Hi,
>>> >> >>>
>>> >> >>> $subject is failing on centos regression for most of the patches
>>> with
>>> >> >>> timeout error.
>>> >> >>>
>>> >> >>> 07:32:34
>>> >> >>>
>>> >> >>>
>>> 
>>> >> >>> 07:32:34 [07:33:05] Running tests in file
>>> >> >>> ./tests/bugs/core/bug-1432542-mpx-restart-crash.t
>>> >> >>> 07:32:34 Timeout set is 300, default 200
>>> >> >>> 07:37:34 ./tests/bugs/core/bug-1432542-mpx-restart-crash.t timed
>>> out
>>> >> >>> after 300 seconds
>>> >> >>> 07:37:34 ./tests/bugs/core/bug-1432542-mpx-restart-crash.t: bad
>>> status
>>> >> >>> 124
>>> >> >>> 07:37:34
>>> >> >>> 07:37:34*
>>> >> >>> 07:37:34*   REGRESSION FAILED   *
>>> >> >>> 07:37:34* Retrying failed tests in case *
>>> >> >>> 07:37:34* we got some spurious failures *
>>> >> >>> 07:37:34*
>>> >> >>> 07:37:34
>>> >> >>> 07:42:34 ./tests/bugs/core/bug-1432542-mpx-restart-crash.t timed
>>> out
>>> >> >>> after 300 seconds
>>> >> >>> 07:42:34 End of test
>>> ./tests/bugs/core/bug-1432542-mpx-restart-crash.t
>>> >> >>> 07:42:34
>>> >> >>>
>>> >> >>>
>>> 
>>> >> >>>
>>> >> >>> Can anyone take a look?
>>> >> >>>
>>> >> >>> Thanks,
>>> >> >>> Karthik
>>> >> >>>
>>> >> >>>
>>> >> >>>
>>> >> >>> ___
>>> >> >>> Gluster-devel mailing list
>>> >> >>> Gluster-devel@gluster.org
>>> >> >>> 

Re: [Gluster-devel] [features/locks] Fetching lock info in lookup

2018-06-20 Thread Xavi Hernandez
On Wed, Jun 20, 2018 at 4:29 PM Raghavendra Gowdappa 
wrote:

> Krutika,
>
> This patch doesn't seem to be getting counts per domain, like number of
> inodelks or entrylks acquired in a domain "xyz". Am I right? If per domain
> stats are not available, passing interested domains in xdata_req would be
> needed. Any suggestions on that?
>

We have GLUSTERFS_INODELK_DOM_COUNT. Its data should be a domain name for
which we want to know the number of inodelks (the count is returned into
GLUSTERFS_INODELK_COUNT though).

It only exists for inodelk. If you need it for entrylk, it would need to be
implemented.

Xavi


> regards,
> Raghavendra
>
> On Wed, Jun 20, 2018 at 12:58 PM, Raghavendra Gowdappa <
> rgowd...@redhat.com> wrote:
>
>>
>>
>> On Wed, Jun 20, 2018 at 12:06 PM, Krutika Dhananjay 
>> wrote:
>>
>>> We do already have a way to get inodelk and entrylk count from a bunch
>>> of fops, introduced in http://review.gluster.org/10880.
>>> Can you check if you can make use of this feature?
>>>
>>
>> Thanks Krutika. Yes, this feature meets DHT's requirement. We might need
>> a GLUSTERFS_PARENT_INODELK, but that can be easily added along the lines of
>> other counts. If necessary I'll send a patch to implement
>> GLUSTERFS_PARENT_INODELK.
>>
>>
>>> -Krutika
>>>
>>>
>>> On Wed, Jun 20, 2018 at 9:17 AM, Amar Tumballi 
>>> wrote:
>>>


 On Wed, Jun 20, 2018 at 9:06 AM, Raghavendra Gowdappa <
 rgowd...@redhat.com> wrote:

> All,
>
> We've a requirement in DHT [1] to query the number of locks granted on
> an inode in lookup fop. I am planning to use xdata_req in lookup to pass
> down the relevant arguments for this query. I am proposing following
> signature:
>
> In lookup request path following key value pairs will be passed in
> xdata_req:
> * "glusterfs.lock.type"
> - values can be "glusterfs.posix", "glusterfs.inodelk",
> "glusterfs.entrylk"
> * If the value of "glusterfs.lock.type" is "glusterfs.entrylk", then
> basename is passed as a value in xdata_req for key
> "glusterfs.entrylk.basename"
> * key "glusterfs.lock-on?" will differentiate whether the lock
> information is on current inode ("glusterfs.current-inode") or 
> parent-inode
> ("glusterfs.parent-inode"). For a nameless lookup "glusterfs.parent-inode"
> is invalid.
> * "glusterfs.blocked-locks" - Information should be limited to blocked
> locks.
> * "glusterfs.granted-locks" - Information should be limited to granted
> locks.
> * If necessary other information about granted locks, blocked locks
> can be added. Since, there is no requirement for now, I am not adding 
> these
> keys.
>
> Response dictionary will have information in following format:
> * "glusterfs.entrylk...granted-locks" - number of
> granted entrylks on inode "gfid" with "basename" (usually this value will
> be either 0 or 1 unless we introduce read/write lock semantics).
> * "glusterfs.inodelk..granted-locks" - number of granted
> inodelks on "basename"
>
> Thoughts?
>
>
 I personally feel, it is good to get as much information possible in
 lookup, so it helps to take some high level decisions better, in all
 translators. So, very broad answer would be to say go for it. The main
 reason the xdata is provided in all fops is to do these extra information
 fetching/overloading anyways.

 As you have clearly documented the need, it makes it even better to
 review and document it with commit. So, all for it.

 Regards,
 Amar


> [1] https://bugzilla.redhat.com/show_bug.cgi?id=1581306#c28
>
>
 ___
 Gluster-devel mailing list
 Gluster-devel@gluster.org
 http://lists.gluster.org/mailman/listinfo/gluster-devel

>>>
>>>
>>
>
___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://lists.gluster.org/mailman/listinfo/gluster-devel

Re: [Gluster-devel] [Gluster-Maintainers] Release 4.1: LTM release targeted for end of May

2018-03-16 Thread Xavi Hernandez
On Tue, Mar 13, 2018 at 2:37 AM, Shyam Ranganathan 
wrote:

> Hi,
>
> As we wind down on 4.0 activities (waiting on docs to hit the site, and
> packages to be available in CentOS repositories before announcing the
> release), it is time to start preparing for the 4.1 release.
>
> 4.1 is where we have GD2 fully functional and shipping with migration
> tools to aid Glusterd to GlusterD2 migrations.
>
> Other than the above, this is a call out for features that are in the
> works for 4.1. Please *post* the github issues to the *devel lists* that
> you would like as a part of 4.1, and also mention the current state of
> development.
>

I would like to propose the transaction framework [1] as an experimental
feature for 4.1. The design [2] is done and I'm currently developing it.

Xavi

[1] https://github.com/gluster/glusterfs/issues/342
[2]
https://docs.google.com/document/d/1Zp99bsfLsB51tPen_zC8enANvOhX3o451pd5LyEdRSM/edit?usp=sharing


> Further, as we hit end of March, we would make it mandatory for features
> to have required spec and doc labels, before the code is merged, so
> factor in efforts for the same if not already done.
>
> Current 4.1 project release lane is empty! I cleaned it up, because I
> want to hear from all as to what content to add, than add things marked
> with the 4.1 milestone by default.
>
> Thanks,
> Shyam
> P.S: Also any volunteers to shadow/participate/run 4.1 as a release owner?
> ___
> maintainers mailing list
> maintain...@gluster.org
> http://lists.gluster.org/mailman/listinfo/maintainers
>
___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://lists.gluster.org/mailman/listinfo/gluster-devel

[Gluster-devel] gNFS service management from glusterd

2018-02-21 Thread Xavi Hernandez
Hi all,

currently glusterd sends a SIGKILL to stop gNFS, while all other services
are stopped with a SIGTERM signal first (this can be seen in
glusterd_svc_stop() function of mgmt/glusterd xlator).

The question is why it cannot be stopped with SIGTERM as all other
services. Using SIGKILL blindly while write I/O is happening can cause
multiple inconsistencies at the same time. For a replicated volume this is
not a problem because it will take one of the replicas as the "good" one
and continue, but for a disperse volume, if the number of inconsistencies
is bigger than the redundancy value, a serious problem could appear.

The probability of this is very small (I've tried to reproduce this problem
on my laptop but I've been unable), but it exists.

Is there any known issue that prevents gNFS to be stopped with a SIGTERM ?
or can it be changed safely ?

Thanks,

Xavi
___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://lists.gluster.org/mailman/listinfo/gluster-devel

Re: [Gluster-devel] Race in protocol/client and RPC

2018-02-01 Thread Xavi Hernandez
On Thu, Feb 1, 2018 at 2:48 PM, Shyam Ranganathan <srang...@redhat.com>
wrote:

> On 02/01/2018 08:25 AM, Xavi Hernandez wrote:
> > After having tried several things, it seems that it will be complex to
> > solve these races. All attempts to fix them have caused failures in
> > other connections. Since I've other work to do and it doesn't seem to be
> > causing serious failures in production, for now I'll leave this. I'll
> > retake this when I've more time.
>
> Xavi, convert the findings into a bug, and post the details there, so
> that it may be followed up? (if not already done)
>

I've just created this bug:
https://bugzilla.redhat.com/show_bug.cgi?id=1541032


> >
> > Xavi
> >
> > On Mon, Jan 29, 2018 at 11:07 PM, Xavi Hernandez <jaher...@redhat.com
> > <mailto:jaher...@redhat.com>> wrote:
> >
> > Hi all,
> >
> > I've identified a race in RPC layer that caused some spurious
> > disconnections and CHILD_DOWN notifications.
> >
> > The problem happens when protocol/client reconfigures a connection
> > to move from glusterd to glusterfsd. This is done by calling
> > rpc_clnt_reconfig() followed by rpc_transport_disconnect().
> >
> > This seems fine because client_rpc_notify() will call
> > rpc_clnt_cleanup_and_start() when the disconnect notification is
> > received. However There's a problem.
> >
> > Suppose that the disconnection notification has been executed and we
> > are just about to call rpc_clnt_cleanup_and_start(). If at this
> > point the reconnection timer is fired, rpc_clnt_reconnect() will be
> > processed. This will cause the socket to be reconnected and a
> > connection notification will be processed. Then a handshake request
> > will be sent to the server.
> >
> > However, when rpc_clnt_cleanup_and_start() continues, all sent XID's
> > are deleted. When we receive the answer from the handshake, we are
> > unable to map the XID, making the request to fail. So the handshake
> > fails and the client is considered down, sending a CHILD_DOWN
> > notification to upper xlators.
> >
> > This causes, in some tests, to start processing things while a brick
> > is down unexpectedly, causing spurious failures on the test.
> >
> > To solve the problem I've forced the rpc_clnt_reconfig() function to
> > disable the RPC connection using similar code to rcp_clnt_disable().
> > This prevents the background rpc_clnt_reconnect() timer to be
> > executed, avoiding the problem.
> >
> > This seems to work fine for many tests, but it seems to be causing
> > some issue in gfapi based tests. I'm still investigating this.
> >
> > Xavi
> >
> >
> >
> >
> > ___
> > Gluster-devel mailing list
> > Gluster-devel@gluster.org
> > http://lists.gluster.org/mailman/listinfo/gluster-devel
> >
>
___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://lists.gluster.org/mailman/listinfo/gluster-devel

Re: [Gluster-devel] Race in protocol/client and RPC

2018-02-01 Thread Xavi Hernandez
After having tried several things, it seems that it will be complex to
solve these races. All attempts to fix them have caused failures in other
connections. Since I've other work to do and it doesn't seem to be causing
serious failures in production, for now I'll leave this. I'll retake this
when I've more time.

Xavi

On Mon, Jan 29, 2018 at 11:07 PM, Xavi Hernandez <jaher...@redhat.com>
wrote:

> Hi all,
>
> I've identified a race in RPC layer that caused some spurious
> disconnections and CHILD_DOWN notifications.
>
> The problem happens when protocol/client reconfigures a connection to move
> from glusterd to glusterfsd. This is done by calling rpc_clnt_reconfig()
> followed by rpc_transport_disconnect().
>
> This seems fine because client_rpc_notify() will call
> rpc_clnt_cleanup_and_start() when the disconnect notification is received.
> However There's a problem.
>
> Suppose that the disconnection notification has been executed and we are
> just about to call rpc_clnt_cleanup_and_start(). If at this point the
> reconnection timer is fired, rpc_clnt_reconnect() will be processed. This
> will cause the socket to be reconnected and a connection notification will
> be processed. Then a handshake request will be sent to the server.
>
> However, when rpc_clnt_cleanup_and_start() continues, all sent XID's are
> deleted. When we receive the answer from the handshake, we are unable to
> map the XID, making the request to fail. So the handshake fails and the
> client is considered down, sending a CHILD_DOWN notification to upper
> xlators.
>
> This causes, in some tests, to start processing things while a brick is
> down unexpectedly, causing spurious failures on the test.
>
> To solve the problem I've forced the rpc_clnt_reconfig() function to
> disable the RPC connection using similar code to rcp_clnt_disable(). This
> prevents the background rpc_clnt_reconnect() timer to be executed, avoiding
> the problem.
>
> This seems to work fine for many tests, but it seems to be causing some
> issue in gfapi based tests. I'm still investigating this.
>
> Xavi
>
___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://lists.gluster.org/mailman/listinfo/gluster-devel

[Gluster-devel] Race in protocol/client and RPC

2018-01-29 Thread Xavi Hernandez
Hi all,

I've identified a race in RPC layer that caused some spurious
disconnections and CHILD_DOWN notifications.

The problem happens when protocol/client reconfigures a connection to move
from glusterd to glusterfsd. This is done by calling rpc_clnt_reconfig()
followed by rpc_transport_disconnect().

This seems fine because client_rpc_notify() will call
rpc_clnt_cleanup_and_start() when the disconnect notification is received.
However There's a problem.

Suppose that the disconnection notification has been executed and we are
just about to call rpc_clnt_cleanup_and_start(). If at this point the
reconnection timer is fired, rpc_clnt_reconnect() will be processed. This
will cause the socket to be reconnected and a connection notification will
be processed. Then a handshake request will be sent to the server.

However, when rpc_clnt_cleanup_and_start() continues, all sent XID's are
deleted. When we receive the answer from the handshake, we are unable to
map the XID, making the request to fail. So the handshake fails and the
client is considered down, sending a CHILD_DOWN notification to upper
xlators.

This causes, in some tests, to start processing things while a brick is
down unexpectedly, causing spurious failures on the test.

To solve the problem I've forced the rpc_clnt_reconfig() function to
disable the RPC connection using similar code to rcp_clnt_disable(). This
prevents the background rpc_clnt_reconnect() timer to be executed, avoiding
the problem.

This seems to work fine for many tests, but it seems to be causing some
issue in gfapi based tests. I'm still investigating this.

Xavi
___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://lists.gluster.org/mailman/listinfo/gluster-devel

Re: [Gluster-devel] Regression tests time

2018-01-27 Thread Xavi Hernandez
Hi Amar,

On 28 Jan 2018 06:50, "Amar Tumballi" <atumb...@redhat.com> wrote:

Thanks for this experiment, Xavi!!

I see two proposals here in the thread.

1. Remove unnecessary sleep commands.
2. Try to bring explicit checks, so our tests are more consistent.

I am personally in favor of 1. Lets do this.

About 2, as its already discussed, we may get into issues due to many
outside glusterfs project setups causing much harder problems to debug. Not
sure if we should depend on our 'eventing' framework in such test cases ?
Would that help?


That would be a good way to detect when something can be done. I've not
worked in these lines yet. But this is not the only way. For example, in
the kill_brick command there was a sleep after that to give time glusterd
to be aware of the change. Instead of the sleep, we can directly request
glusterd the state of the brick. If it's down, we are done without needing
to wait unnecessarily. If for some reason it takes more than one second, we
won't fail spuriously because we are directly checking the state. For
extreme cases where something really fails, we can define a bigger timeout,
for example 5 seconds. This way we cover all cases but in the most common
case it will only take some tens or hundreds of milliseconds.

Reducing timeouts have made more evident some races that currently exist in
the code. Till now I've identified a bug in AFR and a couple of races in
RPC code than were causing spurious failures. I still have to identify
another race (probably in RPC also) that is generating unexpected
disconnections (or incorrect reconnections).

Xavi


Regards,
Amar

On Thu, Jan 25, 2018 at 8:07 PM, Xavi Hernandez <jaher...@redhat.com> wrote:

> On Thu, Jan 25, 2018 at 3:03 PM, Jeff Darcy <j...@pl.atyp.us> wrote:
>
>>
>>
>>
>> On Wed, Jan 24, 2018, at 9:37 AM, Xavi Hernandez wrote:
>>
>> That happens when we use arbitrary delays. If we use an explicit check,
>> it will work on all systems.
>>
>>
>> You're arguing against a position not taken. I'm not expressing
>> opposition to explicit checks. I'm just saying they don't come for free. If
>> you don't believe me, try adding explicit checks in some of the harder
>> cases where we're waiting for something that's subject to OS scheduling
>> delays, or for large numbers of operations to complete. Geo-replication or
>> multiplexing tests should provide some good examples. Adding explicit
>> conditions is the right thing to do in the abstract, but as a practical
>> matter the returns must justify the cost.
>>
>> BTW, some of our longest-running tests are in EC. Do we need all of
>> those, and do they all need to run as long, or could some be
>> eliminated/shortened?
>>
>
> Some tests were already removed some time ago. Anyway, with the changes
> introduced, it takes between 10 and 15 minutes to execute all ec related
> tests from basic/ec and bugs/ec (an average of 16 to 25 seconds per test).
> Before the changes, the same tests were taking between 30 and 60 minutes.
>
> AFR tests have also improved from almost 60 minutes to around 30.
>
>
>> I agree that parallelizing tests is the way to go, but if we reduce the
>> total time to 50%, the parallelized tests will also take 50% less of the
>> time.
>>
>>
>> Taking 50% less time but failing spuriously 1% of the time, or all of the
>> time in some environments, is not a good thing. If you want to add explicit
>> checks that's great, but you also mentioned shortening timeouts and that's
>> much more risky.
>>
>
> If we have a single test that takes 45 minutes (as we currently have in
> some executions: bugs/nfs/bug-1053579.t), parallelization won't help much.
> We need to make this test to run faster.
>
> Some tests that were failing after the changes have revealed errors in the
> test itself or even in the code, so I think it's a good thing. Currently
> I'm investigating what seems a race in the rpc layer during connections
> that causes some tests to fail. This is a real problem that high delays or
> slow machines were hiding. It seems to cause some gluster requests to fail
> spuriously after reconnecting to a brick or glusterd. I'm not 100% sure
> about this yet, but initial analysis seems to indicate that.
>
> Xavi
>
> ___
> Gluster-devel mailing list
> Gluster-devel@gluster.org
> http://lists.gluster.org/mailman/listinfo/gluster-devel
>



-- 
Amar Tumballi (amarts)
___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://lists.gluster.org/mailman/listinfo/gluster-devel

Re: [Gluster-devel] Regression tests time

2018-01-25 Thread Xavi Hernandez
On Thu, Jan 25, 2018 at 3:03 PM, Jeff Darcy <j...@pl.atyp.us> wrote:

>
>
>
> On Wed, Jan 24, 2018, at 9:37 AM, Xavi Hernandez wrote:
>
> That happens when we use arbitrary delays. If we use an explicit check, it
> will work on all systems.
>
>
> You're arguing against a position not taken. I'm not expressing opposition
> to explicit checks. I'm just saying they don't come for free. If you don't
> believe me, try adding explicit checks in some of the harder cases where
> we're waiting for something that's subject to OS scheduling delays, or for
> large numbers of operations to complete. Geo-replication or multiplexing
> tests should provide some good examples. Adding explicit conditions is the
> right thing to do in the abstract, but as a practical matter the returns
> must justify the cost.
>
> BTW, some of our longest-running tests are in EC. Do we need all of those,
> and do they all need to run as long, or could some be eliminated/shortened?
>

Some tests were already removed some time ago. Anyway, with the changes
introduced, it takes between 10 and 15 minutes to execute all ec related
tests from basic/ec and bugs/ec (an average of 16 to 25 seconds per test).
Before the changes, the same tests were taking between 30 and 60 minutes.

AFR tests have also improved from almost 60 minutes to around 30.


> I agree that parallelizing tests is the way to go, but if we reduce the
> total time to 50%, the parallelized tests will also take 50% less of the
> time.
>
>
> Taking 50% less time but failing spuriously 1% of the time, or all of the
> time in some environments, is not a good thing. If you want to add explicit
> checks that's great, but you also mentioned shortening timeouts and that's
> much more risky.
>

If we have a single test that takes 45 minutes (as we currently have in
some executions: bugs/nfs/bug-1053579.t), parallelization won't help much.
We need to make this test to run faster.

Some tests that were failing after the changes have revealed errors in the
test itself or even in the code, so I think it's a good thing. Currently
I'm investigating what seems a race in the rpc layer during connections
that causes some tests to fail. This is a real problem that high delays or
slow machines were hiding. It seems to cause some gluster requests to fail
spuriously after reconnecting to a brick or glusterd. I'm not 100% sure
about this yet, but initial analysis seems to indicate that.

Xavi
___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://lists.gluster.org/mailman/listinfo/gluster-devel

Re: [Gluster-devel] Regression tests time

2018-01-24 Thread Xavi Hernandez
On Wed, Jan 24, 2018 at 3:11 PM, Jeff Darcy <j...@pl.atyp.us> wrote:

>
>
>
> On Tue, Jan 23, 2018, at 12:58 PM, Xavi Hernandez wrote:
>
> I've made some experiments [1] with the time that centos regression takes
> to complete. After some changes the time taken to run a full regression has
> dropped between 2.5 and 3.5 hours (depending on the run time of 2 tests,
> see below).
>
> Basically the changes are related with delays manually introduced in some
> places (sleeps in test files or even in the code, or delays in timer
> events). I've changed some sleeps with better ways to detect some
> condition, and I've left the delays in other places but with reduced time.
> Probably the used values are not the best ones in all cases, but it
> highlights that we should seriously consider how we detect things instead
> of simply waiting for some amount of time (and hope it's enough). The total
> test time is more than 2 hours less with these changes, so this means that
> >2 hours of the whole regression time is spent waiting unnecessarily.
>
>
> We should definitely try to detect specific conditions instead of just
> sleeping for a fixed amount of time. That said, sometimes it would take
> significant additional effort to add a marker for a condition plus code to
> check for it. We need to be *really* careful about changing timeouts in
> these cases. It's easy to come up with something that works on one
> development system and then causes spurious failures for others.
>

That happens when we use arbitrary delays. If we use an explicit check, it
will work on all systems. Additionally, using specific checks makes it
possible to define bigger timeouts to handle corner cases because in the
normal case we'll continue as soon as the check is satisfied, which will be
almost always. But if it really fails, on that particular cases it will
take some time to detect it, which is fine because this way we allow enough
time for "normal" delays.

One of the biggest problems I had to deal with when I implemented
> multiplexing was these kinds of timing dependencies in tests, and I had to
> go through it all again when I came to Facebook. While I applaud the effort
> to reduce single-test times, I believe that parallelizing tests will
> long-term be a more effective (and definitely safer) route to reducing
> overall latency.
>

I agree that parallelizing tests is the way to go, but if we reduce the
total time to 50%, the parallelized tests will also take 50% less of the
time.

Additionally, reducing the time it takes to do each test, is a good way to
detect corner cases. If we always sleep in some cases, we could be missing
some failures that can happen if there's no sleep (and users can do the
same requests than us but without sleeping).


> ___
> Gluster-devel mailing list
> Gluster-devel@gluster.org
> http://lists.gluster.org/mailman/listinfo/gluster-devel
>
___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://lists.gluster.org/mailman/listinfo/gluster-devel

[Gluster-devel] Regression tests time

2018-01-23 Thread Xavi Hernandez
Hi,

I've made some experiments [1] with the time that centos regression takes
to complete. After some changes the time taken to run a full regression has
dropped between 2.5 and 3.5 hours (depending on the run time of 2 tests,
see below).

Basically the changes are related with delays manually introduced in some
places (sleeps in test files or even in the code, or delays in timer
events). I've changed some sleeps with better ways to detect some
condition, and I've left the delays in other places but with reduced time.
Probably the used values are not the best ones in all cases, but it
highlights that we should seriously consider how we detect things instead
of simply waiting for some amount of time (and hope it's enough). The total
test time is more than 2 hours less with these changes, so this means that
>2 hours of the whole regression time is spent waiting unnecessarily.

There are still some issues that I've been unable to solve. Probably the
most critical is the time taken by a couple of tests:

   - tests/bugs/nfs/bug-1053579.t
   - tests/bugs/fuse/many-groups-for-acl.t

These tests take around a minute if they work fine (~60 and ~45 seconds),
but sometimes they take a lot more time (~45 and ~30 minutes) but without
failing. The difference is in the time that it takes to create some system
groups and users.

For example, one of the things the first test does it to create 200 groups.
This is done in ~25 seconds on fast cases and in ~15 minutes on slow cases.
This means that sometimes, creating each group takes more than 4 seconds
while other times it takes around 100 milliseconds. This is > x30
difference.

I'm not sure what is the cause for this. If the slaves are connected to
some external kerberos or ldap source, maybe there are some network issues
(or service unavailability) at some times that cause timeouts or delays. In
my local system (Fedora 27) I see high CPU usage by process sssd_be during
group creation. I'm not sure why or if it also happens on slaves, but it
seems a good candidate. However in my system it seems to always take about
25 seconds to complete.

Even after the changes, tests are full of sleeps. There's one of 180
seconds (bugs/shard/parallel-truncate-read.t). Not sure if it's really
necessary, but there are many more with smaller delays between 1 and 60
seconds. Assuming that each sleep is only executed once, the total time
spent in sleeps is still 15 minutes.

I still need to fix some tests that seem to be failing often after the
changes.

Xavi

[1] https://review.gluster.org/19254
___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://lists.gluster.org/mailman/listinfo/gluster-devel

Re: [Gluster-devel] Simulating some kind of "virtual file"

2018-01-11 Thread Xavi Hernandez
Hi David,

On Wed, Jan 10, 2018 at 3:24 PM, David Spisla <david.spi...@iternity.com>
wrote:

> Hello Amar, Xavi
>
>
>
> *Von:* Amar Tumballi [mailto:atumb...@redhat.com]
> *Gesendet:* Mittwoch, 10. Januar 2018 14:16
> *An:* Xavi Hernandez <jaher...@redhat.com>; David Spisla <
> david.spi...@iternity.com>
> *Cc:* gluster-devel@gluster.org
> *Betreff:* Re: [Gluster-devel] Simulating some kind of "virtual file"
>
>
>
> Check the files in $mountpoint/.meta/ directory. These are all virtual.
> And meta xlator gives a very good idea about how to handle virtual files
> (and directories).
>
>
>
> -Amar
>
> *[David Spisla] Sounds good. Thank you*
>
>
>
> On Wed, Jan 10, 2018 at 6:36 PM, Xavi Hernandez <jaher...@redhat.com>
> wrote:
>
> Hi David,
>
>
>
> On Wed, Jan 10, 2018 at 1:42 PM, David Spisla <david.spi...@iternity.com>
> wrote:
>
> *[David Spisla] I tried this:*
>
> *char *new_path = malloc(1+len_path-5);*
>
> *memcpy(new_path, loc->path, len_path-5);*
>
> *new_path[strlen(new_path)] = '\0';*
>
> *loc->name = new_path + (len_path - len_name);*
>
>
>
> First of all, you should always use memory allocation functions from
> gluster. This includes GF_MALLOC(), gf_strdup(), gf_asprintf() and several
> other variants. You can look at libglusterfs/src/mem-pool.h to see all
> available options.
>
>
>
> The second problem I see is that memcpy() doesn't write a terminating null
> character, so when you compute strlen() afterwards, it will return invalid
> length, or even try to access invalid memory, causing a crash.
>
>
>
> You should do something like this (assuming both loc->path and loc->name
> are not NULL and skipping many necessary checks):
>
>
>
> len_path = strlen(loc->path);
>
> len_name = strlen(loc->name);
>
> new_path = GF_MALLOC(len_path - 4, gf_common_mt_char);
>
> memcpy(new_path, loc->path, len_path - 5);
>
> new_path[len_path - 5] = 0;
>
> loc->name = new_path + len_path - len_name;
>
>
>
> This should work fine.
>
>
>
> Xavi
>
> *[David Spisla] Yes, this worls fine. Thank you **. By the way, is
> there a way inside gluster xlator to get access to xattr or attr of a file.
> In the lookup function there is only the struct loc, but I am missing there
> the files gfid. It seems to be null always. I could use syncop_getxattr()
> with the parameter loc, but the gfid is missing. Can I get the gfid if I
> have only loc->path and loc-> name? It is like a conversion from files path
> to files gfid.*
>
>
>
> One of the main purposes of the 'lookup' fop is to resolve a given path to
> an existing gfid, so you won't find any gfid in the lookup request (unless
> it's a revalidate request). You need to look at the response (cbk) of the
> lookup to get the real gfid. If the request succeeds, you can find the gfid
> in buf->ia_gfid of the lookup callback.
>
>
>
> Other fops that receive a loc_t structure are normally called after a
> successful lookup, so loc->gfid and/or loc->inode->gfid should be set
> (unfortunately there isn't an homogeneous management of loc_t structures by
> xlators, so not always both fields are set).
>
>
>
> You can also request additional xattrs in the lookup request by adding
> them to the xdata dictionary. Their values will be returned in the xdata
> argument of the lookup callback.
>
>
>
> Xavi
>
> *[David Spisla] Ok, so there is no chance to get that gfid in an initial
> lookup.*
>
> In the request, no. Only revalidate lookups will include the gfid, but the
initial one will only contain a path. You need to look at the answer (in
the cbk of lookup) to determine the gfid.

> * Another problem seems to be that there is no loc parameter in lookup_cbk
> function. **I have the buf->gfid and inode, but there is no loc with the
> path and name oft he file.*
>
> In this case, you need to save the path (or the entire loc if you prefer)
when you receive the lookup request and pass it to the cbk to be used
there. To do so you need to create a data structure that needs to be
allocated and filled when the lookup request is received. Then you can pass
this structure to the cbk in two ways (basically):

1. Attach it to frame->local. You can access it later from cbk using
frame->local.

2. Pass it as a "cookie" in STACK_WIND_COOKIE(). You can access it from cbk
using the 'cookie' argument.

> *I want to use syncop_getxattr but I have no loc as parameter. Can I get a
> loc struct with the gfid?*
>
> You can manually construct a loc using the inode, gfid and path
information you have. However it's not a good idea to do

Re: [Gluster-devel] Simulating some kind of "virtual file"

2018-01-10 Thread Xavi Hernandez
Hi David,

On Wed, Jan 10, 2018 at 1:42 PM, David Spisla 
wrote:
>
> *[David Spisla] I tried this:*
>
> *char *new_path = malloc(1+len_path-5);*
>
> *memcpy(new_path, loc->path, len_path-5);*
>
> *new_path[strlen(new_path)] = '\0';*
>
> *loc->name = new_path + (len_path - len_name);*
>
>
>
> First of all, you should always use memory allocation functions from
> gluster. This includes GF_MALLOC(), gf_strdup(), gf_asprintf() and several
> other variants. You can look at libglusterfs/src/mem-pool.h to see all
> available options.
>
>
>
> The second problem I see is that memcpy() doesn't write a terminating null
> character, so when you compute strlen() afterwards, it will return invalid
> length, or even try to access invalid memory, causing a crash.
>
>
>
> You should do something like this (assuming both loc->path and loc->name
> are not NULL and skipping many necessary checks):
>
>
>
> len_path = strlen(loc->path);
>
> len_name = strlen(loc->name);
>
> new_path = GF_MALLOC(len_path - 4, gf_common_mt_char);
>
> memcpy(new_path, loc->path, len_path - 5);
>
> new_path[len_path - 5] = 0;
>
> loc->name = new_path + len_path - len_name;
>
>
>
> This should work fine.
>
>
>
> Xavi
>
> *[David Spisla] Yes, this worls fine. Thank you ****. By the way, is
> there a way inside gluster xlator to get access to xattr or attr of a file.
> In the lookup function there is only the struct loc, but I am missing there
> the files gfid. It seems to be null always. I could use syncop_getxattr()
> with the parameter loc, but the gfid is missing. Can I get the gfid if I
> have only loc->path and loc-> name? It is like a conversion from files path
> to files gfid.*
>

One of the main purposes of the 'lookup' fop is to resolve a given path to
an existing gfid, so you won't find any gfid in the lookup request (unless
it's a revalidate request). You need to look at the response (cbk) of the
lookup to get the real gfid. If the request succeeds, you can find the gfid
in buf->ia_gfid of the lookup callback.

Other fops that receive a loc_t structure are normally called after a
successful lookup, so loc->gfid and/or loc->inode->gfid should be set
(unfortunately there isn't an homogeneous management of loc_t structures by
xlators, so not always both fields are set).

You can also request additional xattrs in the lookup request by adding them
to the xdata dictionary. Their values will be returned in the xdata
argument of the lookup callback.

Xavi
___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://lists.gluster.org/mailman/listinfo/gluster-devel

Re: [Gluster-devel] Simulating some kind of "virtual file"

2018-01-09 Thread Xavi Hernandez
Hi David,

adding again gluster-devel.

On Tue, Jan 9, 2018 at 4:15 PM, David Spisla <david.spi...@iternity.com>
wrote:

> Hello Xavi,
>
>
>
> *Von:* Xavi Hernandez [mailto:jaher...@redhat.com]
> *Gesendet:* Dienstag, 9. Januar 2018 09:48
> *An:* David Spisla <spisl...@googlemail.com>
> *Cc:* gluster-devel@gluster.org
> *Betreff:* Re: [Gluster-devel] Simulating some kind of "virtual file"
>
>
>
> Hi David,
>
>
>
> On Tue, Jan 9, 2018 at 9:09 AM, David Spisla <spisl...@googlemail.com>
> wrote:
>
> Dear Gluster Devels,
>
> at the moment I do some Xlator stuff and I want to know if there is a way
> to simulate the existing of a file to the client. It should be a kind of
> "virtual file". Here are more details.
>
> 1. Client lookup for a file e.g. "apple.test". This file does not exist in
> the backend
>
> $ ls -l apple.test
>
> ls: cannot access apple.test: No such file or directory
>
>
>
> Normally the system will not find that file
>
>
>
> 2. In the backend I have a real file called  e.g. "apple". Now there is a
> Xlator which manipulates the lookup request and is looking for the file
> "apple" instead of "apple.test". Gluster finds the file "apple" and the
> client will get a message from gluster that there is a file called
> "apple.test" with the attributes of the file "apple" (maybe we can
> manipulate that attributes too).
>
>
>
> Intercepting lookup is fine to be able to manipulate "virtual files",
> however it's not enough to completely operate on virtual files. You
> basically need to intercept all file operations that work on a loc or are
> path based and do the translation. You also need to intercept answers from
> readdir and readdirp to do the reverse transformation so that the user sees
> the virtual name and not the real name stored on the bricks.
>
> *[David Spisla] Yes, this is a very good hint. *
>
>
>
>
>
> $ ls -l apple.test
> -rw-r--r-- 1 davids davids 0 Jan  5 15:42 apple.test
>
>
>
> My first idea is, to have a special lookup and lookup_cbk in some Xlator
> in the server stack. Or it is better to have this Xlator in the Client
> Stack?
>
>
>
> That depends on what you are really trying to do. If you don't need any
> information or coordination with other bricks, you can safely create the
> xlator in the server stack.
>
> *[David Spisla] At the moment I do it in the worm xlator. It seems to be
> no problem.*
>
>
>
> The lookup function has a parameter called "loc_t *loc". In a first test I
> tried to manipulate loc->name and loc-path. If I manipulate loc->path I got
> an error and my volume crashed.
>
>
>
> You should be able to modify the loc without causing any crash. However
> there are some details you must be aware of:
>
>- loc->path and/or loc->name can be NULL in some cases.
>- If loc->path and loc->name are not NULL, loc->name always points to
>a substring of loc->path. It's not allocated independently (and so it must
>not be freed).
>- If you change loc->path, you also need to change loc->name (to point
>to the basename of loc->path)
>- You shouldn't change the contents of loc->path directly. It's better
>to allocate a new string with the modified path and assign it to loc->path
>(you need to free the old value of loc->path to avoid memory leaks).
>
> *[David Spisla] I tried this:*
>
> *char *new_path = malloc(1+len_path-5);*
>
> *memcpy(new_path, loc->path, len_path-5);*
>
> *new_path[strlen(new_path)] = '\0';*
>
> *loc->name = new_path + (len_path - len_name);*
>

First of all, you should always use memory allocation functions from
gluster. This includes GF_MALLOC(), gf_strdup(), gf_asprintf() and several
other variants. You can look at libglusterfs/src/mem-pool.h to see all
available options.

The second problem I see is that memcpy() doesn't write a terminating null
character, so when you compute strlen() afterwards, it will return invalid
length, or even try to access invalid memory, causing a crash.

You should do something like this (assuming both loc->path and loc->name
are not NULL and skipping many necessary checks):

len_path = strlen(loc->path);

len_name = strlen(loc->name);

new_path = GF_MALLOC(len_path - 4, gf_common_mt_char);

memcpy(new_path, loc->path, len_path - 5);

new_path[len_path - 5] = 0;

loc->name = new_path + len_path - len_name;


This should work fine.

Xavi


>
>
> So, if I do this command:
>
> $ ls -l /test/dir/test1.txt.test
>
> Sometimes it is working but sometimes I got stra

Re: [Gluster-devel] Simulating some kind of "virtual file"

2018-01-09 Thread Xavi Hernandez
Hi David,

On Tue, Jan 9, 2018 at 9:09 AM, David Spisla 
wrote:

> Dear Gluster Devels,
>
> at the moment I do some Xlator stuff and I want to know if there is a way
> to simulate the existing of a file to the client. It should be a kind of
> "virtual file". Here are more details.
>
> 1. Client lookup for a file e.g. "apple.test". This file does not exist in
> the backend
> $ ls -l apple.test
> ls: cannot access apple.test: No such file or directory
>
> Normally the system will not find that file
>
> 2. In the backend I have a real file called  e.g. "apple". Now there is a
> Xlator which manipulates the lookup request and is looking for the file
> "apple" instead of "apple.test". Gluster finds the file "apple" and the
> client will get a message from gluster that there is a file called
> "apple.test" with the attributes of the file "apple" (maybe we can
> manipulate that attributes too).
>

Intercepting lookup is fine to be able to manipulate "virtual files",
however it's not enough to completely operate on virtual files. You
basically need to intercept all file operations that work on a loc or are
path based and do the translation. You also need to intercept answers from
readdir and readdirp to do the reverse transformation so that the user sees
the virtual name and not the real name stored on the bricks.


>
> $ ls -l apple.test
> -rw-r--r-- 1 davids davids 0 Jan  5 15:42 apple.test
>
> My first idea is, to have a special lookup and lookup_cbk in some Xlator
> in the server stack. Or it is better to have this Xlator in the Client
> Stack?
>

That depends on what you are really trying to do. If you don't need any
information or coordination with other bricks, you can safely create the
xlator in the server stack.

The lookup function has a parameter called "loc_t *loc". In a first test I
> tried to manipulate loc->name and loc-path. If I manipulate loc->path I got
> an error and my volume crashed.
>

You should be able to modify the loc without causing any crash. However
there are some details you must be aware of:

   - loc->path and/or loc->name can be NULL in some cases.
   - If loc->path and loc->name are not NULL, loc->name always points to a
   substring of loc->path. It's not allocated independently (and so it must
   not be freed).
   - If you change loc->path, you also need to change loc->name (to point
   to the basename of loc->path)
   - You shouldn't change the contents of loc->path directly. It's better
   to allocate a new string with the modified path and assign it to loc->path
   (you need to free the old value of loc->path to avoid memory leaks).

I think this should be enough to correctly change loc->path and loc->name.

Xavi


> Any hints?
>
> Regards
> David Spisla
>
> ___
> Gluster-devel mailing list
> Gluster-devel@gluster.org
> http://lists.gluster.org/mailman/listinfo/gluster-devel
>
___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://lists.gluster.org/mailman/listinfo/gluster-devel

Re: [Gluster-devel] glusterd crashes on /tests/bugs/replicate/bug-884328.t

2017-12-15 Thread Xavi Hernandez
I've uploaded a patch to fix this problem: https://review.gluster.org/19040

On Fri, Dec 15, 2017 at 11:33 AM, Xavi Hernandez <jaher...@redhat.com>
wrote:

> I've checked the size of 'gluster volume set help' on current master and
> it's 51176 bytes. Only 24 bytes below the size of the buffer.
>
> I think the reason why regression tests fail is that it enables bd xlator,
> which adds some more options that make the help output to grow beyond the
> buffer size.
>
> I'll send a patch to fix the problem.
>
> Xavi
>
> On Fri, Dec 15, 2017 at 10:05 AM, Xavi Hernandez <jaher...@redhat.com>
> wrote:
>
>> On Fri, Dec 15, 2017 at 9:57 AM, Atin Mukherjee <amukh...@redhat.com>
>> wrote:
>>
>>> But why doesn't it crash every time if this is the RCA? None of us could
>>> actually reproduce it locally.
>>>
>>
>> That's a good question. One of my patches has failed and it doesn't add
>> any new option (in fact it's a very trivial change), so I'm not sure why it
>> may or may not crash.
>>
>> I'll analyze it. Anyway, that function needs a patch because there's no
>> space limit check before writing to the buffer.
>>
>> Xavi
>>
>>
>>> On Fri, Dec 15, 2017 at 2:23 PM, Xavi Hernandez <jaher...@redhat.com>
>>> wrote:
>>>
>>>> I've seen this failure in one of my local tests and I've done a quick
>>>> analysis:
>>>>
>>>> (gdb) bt
>>>> #0  0x7ff29e1fce07 in ?? () from /lib64/libgcc_s.so.1
>>>> #1  0x7ff29e1fe9b8 in _Unwind_Backtrace () from
>>>> /lib64/libgcc_s.so.1
>>>> #2  0x7ff2aa9fb458 in backtrace () from /lib64/libc.so.6
>>>> #3  0x7ff2ac14af30 in _gf_msg_backtrace_nomem (level=GF_LOG_ALERT,
>>>> stacksize=200) at logging.c:1128
>>>> #4  0x7ff2ac151170 in gf_print_trace (signum=11, ctx=0xdec260) at
>>>> common-utils.c:762
>>>> #5  0x0040a2c6 in glusterfsd_print_trace (signum=11) at
>>>> glusterfsd.c:2274
>>>> #6  
>>>> #7  0x7ff2ac466751 in _dl_close () from /lib64/ld-linux-x86-64.so.2
>>>> #8  0x7ff2aaa304df in _dl_catch_error () from /lib64/libc.so.6
>>>> #9  0x7ff2ab35f715 in _dlerror_run () from /lib64/libdl.so.2
>>>> #10 0x7ff2ab35f08f in dlclose () from /lib64/libdl.so.2
>>>> #11 0x7ff2a06af786 in glusterd_get_volopt_content
>>>> (ctx=0x7ff298000d88, xml_out=false) at glusterd-utils.c:13150
>>>> #12 0x7ff2a06a2896 in glusterd_volset_help
>>>> (dict=0x70616e732d776f68, op_errstr=0x732e736572757461) at
>>>> glusterd-utils.c:9199
>>>> Backtrace stopped: previous frame inner to this frame (corrupt stack?)
>>>> (gdb) f 11
>>>> #11 0x7ff2a06af786 in glusterd_get_volopt_content
>>>> (ctx=0x7ff298000d88, xml_out=false) at glusterd-utils.c:13150
>>>> 13150   dlclose (dl_handle);
>>>> (gdb) print dl_handle
>>>> $1 = (void *) 0x6978656c7069746c
>>>> (gdb) x/s _handle
>>>> 0x7ff294206500: "ltiplexing feature is disabled.\n\n"
>>>> (gdb)
>>>>
>>>> So I think the problem is a buffer overflow.
>>>>
>>>> Looking at the code in glusterd-utils.c, function
>>>> glusterd_get_volopt_content(), I guess that we are writing too much data
>>>> into output_string, which is a stack defined array of 50 KB, and we have an
>>>> overflow there. Probably the number of options and its description has
>>>> grown beyond this limit.
>>>>
>>>> I'll send a patch for this shortly.
>>>>
>>>> Xavi
>>>>
>>>> On Fri, Dec 15, 2017 at 8:31 AM, Sunny Kumar <sunku...@redhat.com>
>>>> wrote:
>>>>
>>>>> +1
>>>>>
>>>>> Console log
>>>>> https://build.gluster.org/job/centos6-regression/8021/console
>>>>>
>>>>> Regard
>>>>> Sunny
>>>>>
>>>>> On Fri, Dec 15, 2017 at 12:32 PM, Ravishankar N <
>>>>> ravishan...@redhat.com> wrote:
>>>>> > ...for a lot of patches on master .The crash is in volume set; the
>>>>> .t just
>>>>> > does a volume set help. Can the glusterd devs take a look as it is
>>>>> blocking
>>>>> > merging patches? I have raised BZ 1526268 with the details.
>>>>> >
>>>>> > Thanks!
>>>>> >
>>>>> > Ravi
>>>>> >
>>>>> > ___
>>>>> > Gluster-devel mailing list
>>>>> > Gluster-devel@gluster.org
>>>>> > http://lists.gluster.org/mailman/listinfo/gluster-devel
>>>>> ___
>>>>> Gluster-devel mailing list
>>>>> Gluster-devel@gluster.org
>>>>> http://lists.gluster.org/mailman/listinfo/gluster-devel
>>>>>
>>>>
>>>>
>>>> ___
>>>> Gluster-devel mailing list
>>>> Gluster-devel@gluster.org
>>>> http://lists.gluster.org/mailman/listinfo/gluster-devel
>>>>
>>>
>>>
>>
>
___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://lists.gluster.org/mailman/listinfo/gluster-devel

Re: [Gluster-devel] glusterd crashes on /tests/bugs/replicate/bug-884328.t

2017-12-15 Thread Xavi Hernandez
I've checked the size of 'gluster volume set help' on current master and
it's 51176 bytes. Only 24 bytes below the size of the buffer.

I think the reason why regression tests fail is that it enables bd xlator,
which adds some more options that make the help output to grow beyond the
buffer size.

I'll send a patch to fix the problem.

Xavi

On Fri, Dec 15, 2017 at 10:05 AM, Xavi Hernandez <jaher...@redhat.com>
wrote:

> On Fri, Dec 15, 2017 at 9:57 AM, Atin Mukherjee <amukh...@redhat.com>
> wrote:
>
>> But why doesn't it crash every time if this is the RCA? None of us could
>> actually reproduce it locally.
>>
>
> That's a good question. One of my patches has failed and it doesn't add
> any new option (in fact it's a very trivial change), so I'm not sure why it
> may or may not crash.
>
> I'll analyze it. Anyway, that function needs a patch because there's no
> space limit check before writing to the buffer.
>
> Xavi
>
>
>> On Fri, Dec 15, 2017 at 2:23 PM, Xavi Hernandez <jaher...@redhat.com>
>> wrote:
>>
>>> I've seen this failure in one of my local tests and I've done a quick
>>> analysis:
>>>
>>> (gdb) bt
>>> #0  0x7ff29e1fce07 in ?? () from /lib64/libgcc_s.so.1
>>> #1  0x7ff29e1fe9b8 in _Unwind_Backtrace () from /lib64/libgcc_s.so.1
>>> #2  0x7ff2aa9fb458 in backtrace () from /lib64/libc.so.6
>>> #3  0x7ff2ac14af30 in _gf_msg_backtrace_nomem (level=GF_LOG_ALERT,
>>> stacksize=200) at logging.c:1128
>>> #4  0x7ff2ac151170 in gf_print_trace (signum=11, ctx=0xdec260) at
>>> common-utils.c:762
>>> #5  0x0040a2c6 in glusterfsd_print_trace (signum=11) at
>>> glusterfsd.c:2274
>>> #6  
>>> #7  0x7ff2ac466751 in _dl_close () from /lib64/ld-linux-x86-64.so.2
>>> #8  0x7ff2aaa304df in _dl_catch_error () from /lib64/libc.so.6
>>> #9  0x7ff2ab35f715 in _dlerror_run () from /lib64/libdl.so.2
>>> #10 0x7ff2ab35f08f in dlclose () from /lib64/libdl.so.2
>>> #11 0x7ff2a06af786 in glusterd_get_volopt_content
>>> (ctx=0x7ff298000d88, xml_out=false) at glusterd-utils.c:13150
>>> #12 0x7ff2a06a2896 in glusterd_volset_help (dict=0x70616e732d776f68,
>>> op_errstr=0x732e736572757461) at glusterd-utils.c:9199
>>> Backtrace stopped: previous frame inner to this frame (corrupt stack?)
>>> (gdb) f 11
>>> #11 0x7ff2a06af786 in glusterd_get_volopt_content
>>> (ctx=0x7ff298000d88, xml_out=false) at glusterd-utils.c:13150
>>> 13150   dlclose (dl_handle);
>>> (gdb) print dl_handle
>>> $1 = (void *) 0x6978656c7069746c
>>> (gdb) x/s _handle
>>> 0x7ff294206500: "ltiplexing feature is disabled.\n\n"
>>> (gdb)
>>>
>>> So I think the problem is a buffer overflow.
>>>
>>> Looking at the code in glusterd-utils.c, function
>>> glusterd_get_volopt_content(), I guess that we are writing too much data
>>> into output_string, which is a stack defined array of 50 KB, and we have an
>>> overflow there. Probably the number of options and its description has
>>> grown beyond this limit.
>>>
>>> I'll send a patch for this shortly.
>>>
>>> Xavi
>>>
>>> On Fri, Dec 15, 2017 at 8:31 AM, Sunny Kumar <sunku...@redhat.com>
>>> wrote:
>>>
>>>> +1
>>>>
>>>> Console log
>>>> https://build.gluster.org/job/centos6-regression/8021/console
>>>>
>>>> Regard
>>>> Sunny
>>>>
>>>> On Fri, Dec 15, 2017 at 12:32 PM, Ravishankar N <ravishan...@redhat.com>
>>>> wrote:
>>>> > ...for a lot of patches on master .The crash is in volume set; the .t
>>>> just
>>>> > does a volume set help. Can the glusterd devs take a look as it is
>>>> blocking
>>>> > merging patches? I have raised BZ 1526268 with the details.
>>>> >
>>>> > Thanks!
>>>> >
>>>> > Ravi
>>>> >
>>>> > ___
>>>> > Gluster-devel mailing list
>>>> > Gluster-devel@gluster.org
>>>> > http://lists.gluster.org/mailman/listinfo/gluster-devel
>>>> ___
>>>> Gluster-devel mailing list
>>>> Gluster-devel@gluster.org
>>>> http://lists.gluster.org/mailman/listinfo/gluster-devel
>>>>
>>>
>>>
>>> ___
>>> Gluster-devel mailing list
>>> Gluster-devel@gluster.org
>>> http://lists.gluster.org/mailman/listinfo/gluster-devel
>>>
>>
>>
>
___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://lists.gluster.org/mailman/listinfo/gluster-devel

Re: [Gluster-devel] glusterd crashes on /tests/bugs/replicate/bug-884328.t

2017-12-15 Thread Xavi Hernandez
On Fri, Dec 15, 2017 at 9:57 AM, Atin Mukherjee <amukh...@redhat.com> wrote:

> But why doesn't it crash every time if this is the RCA? None of us could
> actually reproduce it locally.
>

That's a good question. One of my patches has failed and it doesn't add any
new option (in fact it's a very trivial change), so I'm not sure why it may
or may not crash.

I'll analyze it. Anyway, that function needs a patch because there's no
space limit check before writing to the buffer.

Xavi


> On Fri, Dec 15, 2017 at 2:23 PM, Xavi Hernandez <jaher...@redhat.com>
> wrote:
>
>> I've seen this failure in one of my local tests and I've done a quick
>> analysis:
>>
>> (gdb) bt
>> #0  0x7ff29e1fce07 in ?? () from /lib64/libgcc_s.so.1
>> #1  0x7ff29e1fe9b8 in _Unwind_Backtrace () from /lib64/libgcc_s.so.1
>> #2  0x7ff2aa9fb458 in backtrace () from /lib64/libc.so.6
>> #3  0x7ff2ac14af30 in _gf_msg_backtrace_nomem (level=GF_LOG_ALERT,
>> stacksize=200) at logging.c:1128
>> #4  0x7ff2ac151170 in gf_print_trace (signum=11, ctx=0xdec260) at
>> common-utils.c:762
>> #5  0x0040a2c6 in glusterfsd_print_trace (signum=11) at
>> glusterfsd.c:2274
>> #6  
>> #7  0x7ff2ac466751 in _dl_close () from /lib64/ld-linux-x86-64.so.2
>> #8  0x7ff2aaa304df in _dl_catch_error () from /lib64/libc.so.6
>> #9  0x7ff2ab35f715 in _dlerror_run () from /lib64/libdl.so.2
>> #10 0x7ff2ab35f08f in dlclose () from /lib64/libdl.so.2
>> #11 0x7ff2a06af786 in glusterd_get_volopt_content
>> (ctx=0x7ff298000d88, xml_out=false) at glusterd-utils.c:13150
>> #12 0x7ff2a06a2896 in glusterd_volset_help (dict=0x70616e732d776f68,
>> op_errstr=0x732e736572757461) at glusterd-utils.c:9199
>> Backtrace stopped: previous frame inner to this frame (corrupt stack?)
>> (gdb) f 11
>> #11 0x7ff2a06af786 in glusterd_get_volopt_content
>> (ctx=0x7ff298000d88, xml_out=false) at glusterd-utils.c:13150
>> 13150   dlclose (dl_handle);
>> (gdb) print dl_handle
>> $1 = (void *) 0x6978656c7069746c
>> (gdb) x/s _handle
>> 0x7ff294206500: "ltiplexing feature is disabled.\n\n"
>> (gdb)
>>
>> So I think the problem is a buffer overflow.
>>
>> Looking at the code in glusterd-utils.c, function
>> glusterd_get_volopt_content(), I guess that we are writing too much data
>> into output_string, which is a stack defined array of 50 KB, and we have an
>> overflow there. Probably the number of options and its description has
>> grown beyond this limit.
>>
>> I'll send a patch for this shortly.
>>
>> Xavi
>>
>> On Fri, Dec 15, 2017 at 8:31 AM, Sunny Kumar <sunku...@redhat.com> wrote:
>>
>>> +1
>>>
>>> Console log
>>> https://build.gluster.org/job/centos6-regression/8021/console
>>>
>>> Regard
>>> Sunny
>>>
>>> On Fri, Dec 15, 2017 at 12:32 PM, Ravishankar N <ravishan...@redhat.com>
>>> wrote:
>>> > ...for a lot of patches on master .The crash is in volume set; the .t
>>> just
>>> > does a volume set help. Can the glusterd devs take a look as it is
>>> blocking
>>> > merging patches? I have raised BZ 1526268 with the details.
>>> >
>>> > Thanks!
>>> >
>>> > Ravi
>>> >
>>> > ___
>>> > Gluster-devel mailing list
>>> > Gluster-devel@gluster.org
>>> > http://lists.gluster.org/mailman/listinfo/gluster-devel
>>> ___
>>> Gluster-devel mailing list
>>> Gluster-devel@gluster.org
>>> http://lists.gluster.org/mailman/listinfo/gluster-devel
>>>
>>
>>
>> ___
>> Gluster-devel mailing list
>> Gluster-devel@gluster.org
>> http://lists.gluster.org/mailman/listinfo/gluster-devel
>>
>
>
___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://lists.gluster.org/mailman/listinfo/gluster-devel

[Gluster-devel] Message id's for components

2017-12-12 Thread Xavi Hernandez
Hi,

I've uploaded a patch [1] to change the way used to reserve a range of
messages to components and to define message id's inside a component.

The old method was error prone because adding a new component needed to
define some macros based on previous macros (in fact there was already an
invalid definition for a component). The definition of message id's for a
component also required more than one modification and in some cases it was
not correctly done.

The new patch defines the message constants using an enum, which automates
all assignments, preventing most of the errors. It also defines a couple of
macros that makes it even easier to create the component ranges and message
id's.

Any feedback will be appreciated.

Regards,

Xavi

[1] https://review.gluster.org/19029
___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://lists.gluster.org/mailman/listinfo/gluster-devel

Re: [Gluster-devel] Selfheal on mount process (disperse)

2017-11-15 Thread Xavi Hernandez
Hi,

On Wed, Nov 15, 2017 at 6:19 AM, jayakrishnan mm 
wrote:

> Hi,
>
> Glusterfs ver 3.7.10
> Volume : disperse (4+2)
> Client on separate machine.
> 1 brick offline.
> Error  happens after about 60 seconds of starting write. When checked the
> online brick's
> .glusterfs/indices/xattrop  , I could see a gfid entry.
>

That's normal. When a brick is down, other bricks keep a mark indicating
that the file needs to be repaired once modifications have happened.


>
> Why the mount process starts healing ? How to prevent this ? When checked
> the source code, (ec-heald.c) I can see the that this dir is scanned every
> 60 sec. If it finds an entry, it starts healing. But why should the client
> do this ? Is there an option to turn off selfheal on the client side ?
>

The periodic check should only be done by the self-heal daemon, not by
clients. Clients only try to heal files if they are accessed by a user, to
repair them faster and on demand. There have been some improvements in
self-heal detection to avoid some cases where self-healing was being
triggered more than necessary, but these patches are present starting at
3.10 (3.7 is already EOL).

On 3.7 there's an option called 'disperse.background-heals' that can be set
to 0 to avoid client side self-heals.

Anyway, could you attach the log file to see the error you are getting ?

Xavi


> Regards
> JK
>
>
>
>
> ___
> Gluster-devel mailing list
> Gluster-devel@gluster.org
> http://lists.gluster.org/mailman/listinfo/gluster-devel
>
___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://lists.gluster.org/mailman/listinfo/gluster-devel

Re: [Gluster-devel] Gluster Summit Discussion: Time taken for regression tests

2017-11-08 Thread Xavi Hernandez
One thing we could do with some tests I know is to remove some of them.

EC currently runs the same test on multiple volume configurations (2+1,
3+1, 4+1, 3+2, 4+2, 4+3 and 8+4). I think we could reduce it to two common
configurations (2+1 and 4+2) and one or two special configurations (3+1
and/or 3+2). This will remove the biggest ones, that take most of the time,
while keeping the basic things still tested.

Xavi

On 3 November 2017 at 17:50, Amar Tumballi  wrote:

> All,
>
> While we discussed many other things, we also discussed about reducing
> time taken for the regression jobs. As it stands now, it take around 5hr
> 40mins to complete a single run.
>
> There were many suggestions:
>
>
>- Run them in parallel (as each .t test is independent of each other)
>- Revisit the tests taking long time (20 tests take almost 6000
>seconds as of now).
>- See if we can run the tests in docker (but the issue is the machines
>we have are of 2cores, so there may not be much gain)
>
>
> There are other suggestions as well:
>
>
>- Spend effort and see if there are repeated steps, and merge the
>tests.
>   - Most of the time is spent in starting the processes and cleaning
>   up.
>   - Most of the tests run the similar volume create command
>   (depending on the volume type), and run few different type of I/O in
>   different tests.
>   - Try to see if these things can be merged.
>   - Most of the bug-fix .t files belong to this category too.
>- Classify the tests specific to few non-overlapping volume types and
>depending on the changeset in the patch (based on the files changed) decide
>which are the groups to run.
>   - For example, you can't have replicate and disperse volume type
>   together.
>
>
> 
>
> More ideas and suggestions welcome.
>
>
> Regards,
> Amar
>
>
> ___
> Gluster-devel mailing list
> Gluster-devel@gluster.org
> http://lists.gluster.org/mailman/listinfo/gluster-devel
>
___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://lists.gluster.org/mailman/listinfo/gluster-devel

[Gluster-devel] About GF_ASSERT() macro

2017-11-03 Thread Xavi Hernandez
Hi all,

I've seen that GF_ASSERT() macro is defined in different ways depending on
if we are building in debug mode or not.

In debug mode, it's an alias of assert(), but in non-debug mode it simply
logs an error message and continues.

I think that an assert should be a critical check that should always be
true, specially in production code. Allowing the program to continue after
a failure on one of these checks is dangerous. Most probably it will crash
later, losing some information about the real cause of the error. But even
if it doesn't crash, some internal data will be invalid, leading to a bad
behavior.

I think we should always terminate the process if an assertion fails, even
in production level code. If some failure is not considered so much
critical by the coder, it should not use GF_ASSERT() and only write a log
message or use another of the condition check macros.

Thoughts ?

Xavi
___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://lists.gluster.org/mailman/listinfo/gluster-devel

[Gluster-devel] Feature proposal: xlator to optimize heal and rebalance operations

2017-11-02 Thread Xavi Hernandez
Hi all,

I've created a new GitHub issue [1] to discuss an idea to optimize
self-heal and rebalance operations by not requiring to take a lock during
data operations.

Any thoughts will be welcome.

Regards,

Xavi

[1] https://github.com/gluster/glusterfs/issues/347
___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://lists.gluster.org/mailman/listinfo/gluster-devel

[Gluster-devel] String manipulation

2017-11-02 Thread Xavi Hernandez
Hi all,

Several times I've seen issues with the way strings are handled in many
parts of the code. Sometimes it's because of an incorrect use of some
functions, like strncat(). Others it's because of a lack of error
conditions check. Others it's a failure in allocating the right amount of
memory, or even creating a big array in the stack.

Maybe we should create a set of library functions to work with strings to
hide all these details and make it easier (and less error prone) to
manipulate strings. I've something already written some time ago that I can
adapt to gluster.

On top of that we could expand it by adding path manipulation functions and
string parsing features.

Do you think it's worth it ?

Xavi
___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://lists.gluster.org/mailman/listinfo/gluster-devel

  1   2   >