Re: [Gluster-users] Lots of connections on clients - appropriate values for various thread parameters

2019-03-28 Thread Raghavendra Gowdappa
+Gluster-users 

Sorry about the delay. There is nothing suspicious about per thread CPU
utilization of glusterfs process. However looking at the volume profile
attached I see huge number of lookups. I think if we cutdown the number of
lookups probably we'll see improvements in performance. I need following
information:

* dump of fuse traffic under heavy load (use --dump-fuse option while
mounting)
* client volume profile for the duration of heavy load -
https://docs.gluster.org/en/latest/Administrator%20Guide/Performance%20Testing/
* corresponding brick volume profile

Basically I need to find out
* whether these lookups are on existing files or non-existent files
* whether they are on directories or files
* why/whether md-cache or kernel attribute cache or nl-cache will help to
cut down lookups.

regards,
Raghavendra

On Mon, Mar 25, 2019 at 12:13 PM Hu Bert  wrote:

> Hi Raghavendra,
>
> sorry, this took a while. The last weeks the weather was bad -> less
> traffic, but this weekend there was a massive peak. I made 3 profiles
> with top, but at first look there's nothing special here.
>
> I also made a gluster profile (on one of the servers) at a later
> moment. Maybe that helps. I also added some munin graphics from 2 of
> the clients and 1 graphic of server network, just to show how massive
> the problem is.
>
> Just wondering if the high io wait is related to the high network
> traffic bug (https://bugzilla.redhat.com/show_bug.cgi?id=1673058); if
> so, i could deactivate performance.quick-read and check if there is
> less iowait. If that helps: wonderful - and yearningly awaiting
> updated packages (e.g. v5.6). If not: maybe we have to switch from our
> normal 10TB hdds (raid10) to SSDs if the problem is based on slow
> hardware in the use case of small files (images).
>
>
> Thx,
> Hubert
>
> Am Mo., 4. März 2019 um 16:59 Uhr schrieb Raghavendra Gowdappa
> :
> >
> > Were you seeing high Io-wait when you captured the top output? I guess
> not as you mentioned the load increases during weekend. Please note that
> this data has to be captured when you are experiencing problems.
> >
> > On Mon, Mar 4, 2019 at 8:02 PM Hu Bert  wrote:
> >>
> >> Hi,
> >> sending the link directly to  you and not the list, you can distribute
> >> if necessary. the command ran for about half a minute. Is that enough?
> >> More? Less?
> >>
> >> https://download.outdooractive.com/top.output.tar.gz
> >>
> >> Am Mo., 4. März 2019 um 15:21 Uhr schrieb Raghavendra Gowdappa
> >> :
> >> >
> >> >
> >> >
> >> > On Mon, Mar 4, 2019 at 7:47 PM Raghavendra Gowdappa <
> rgowd...@redhat.com> wrote:
> >> >>
> >> >>
> >> >>
> >> >> On Mon, Mar 4, 2019 at 4:26 PM Hu Bert 
> wrote:
> >> >>>
> >> >>> Hi Raghavendra,
> >> >>>
> >> >>> at the moment iowait and cpu consumption is quite low, the main
> >> >>> problems appear during the weekend (high traffic, especially on
> >> >>> sunday), so either we have to wait until next sunday or use a time
> >> >>> machine ;-)
> >> >>>
> >> >>> I made a screenshot of top (https://abload.de/img/top-hvvjt2.jpg)
> and
> >> >>> a text output (https://pastebin.com/TkTWnqxt), maybe that helps.
> Seems
> >> >>> like processes like glfs_fuseproc (>204h) and glfs_epoll (64h for
> each
> >> >>> process) consume a lot of CPU (uptime 24 days). Is that already
> >> >>> helpful?
> >> >>
> >> >>
> >> >> Not much. The TIME field just says the amount of time the thread has
> been executing. Since its a long standing mount, we can expect such large
> values. But, the value itself doesn't indicate whether the thread itself
> was overloaded at any (some) interval(s).
> >> >>
> >> >> Can you please collect output of following command and send back the
> collected data?
> >> >>
> >> >> # top -bHd 3 > top.output
> >> >
> >> >
> >> > Please collect this on problematic mounts and bricks.
> >> >
> >> >>
> >> >>>
> >> >>>
> >> >>> Hubert
> >> >>>
> >> >>> Am Mo., 4. März 2019 um 11:31 Uhr schrieb Raghavendra Gowdappa
> >> >>> :
> >> >>> >
> >> >>> > what is the per thread CPU usage like on these clients? With
> highly concurrent workloads we've seen single thread that reads requests
> from /dev/fuse (fuse reader thread) becoming bottleneck. Would like to know
> what is the cpu usage of this thread looks like (you can use top -H).
> >> >>> >
> >> >>> > On Mon, Mar 4, 2019 at 3:39 PM Hu Bert 
> wrote:
> >> >>> >>
> >> >>> >> Good morning,
> >> >>> >>
> >> >>> >> we use gluster v5.3 (replicate with 3 servers, 2 volumes, raid10
> as
> >> >>> >> brick) with at the moment 10 clients; 3 of them do heavy I/O
> >> >>> >> operations (apache tomcats, read+write of (small) images). These
> 3
> >> >>> >> clients have a quite high I/O wait (stats from yesterday) as can
> be
> >> >>> >> seen here:
> >> >>> >>
> >> >>> >> client: https://abload.de/img/client1-cpu-dayulkza.png
> >> >>> >> server: https://abload.de/img/server1-cpu-dayayjdq.png
> >> >>> >>
> >> >>> >> The iowait in the graphics differ a lot. I checked netstat for
> the
> >> >>> >> 

Re: [Gluster-users] Inconsistent issues with a client

2019-03-28 Thread Nithya Balachandran
Hi,

If you know which directories are problematic, please check and see if the
permissions on them are correct on the individual bricks.
Please also provide the following:

   - *gluster volume info* for the volume
   - The gluster version you are running


regards,
Nithya

On Wed, 27 Mar 2019 at 19:10, Tami Greene  wrote:

> The system is a 5 server, 20 brick distributed system with a hardware
> configured RAID 6 underneath with xfs as filesystem.  This client is a data
> collection node which transfers data to specific directories within one of
> the gluster volumes.
>
>
>
> I have a client with submounted directories (glustervolume/project) rather
> than the entire volume.  Some files can be transferred no problem, but
> others send an error about transport endpoint not connected.  The transfer
> is handed by a rsync script triggered as a cron job.
>
>
>
> When remotely connected to this client, user access to these files does
> not always behave as they are set – 2770 for directories and 440.  Owners
> are not always able to move the files, processes ran as the owners are not
> always able to move files; root is not always allowed to move or delete
> these file.
>
>
>
> This process seemed to worked smoothly before adding another server and 4
> storage bricks to the volume, logs indicate there were intermittent issues
> at least a month before the last server was added.  While a new collection
> device has been streaming to this one machine, the issue started the day
> before.
>
>
>
> Is there another level for permissions and ownership that I am not aware
> of that needs to be sync’d?
>
>
> --
> Tami
> ___
> Gluster-users mailing list
> Gluster-users@gluster.org
> https://lists.gluster.org/mailman/listinfo/gluster-users
___
Gluster-users mailing list
Gluster-users@gluster.org
https://lists.gluster.org/mailman/listinfo/gluster-users

[Gluster-users] [Event CfP Announce] DevConf events India and US in the month of August 2019

2019-03-28 Thread Sankarshan Mukhopadhyay
2 editions of DevConf have their CfPs open

[1] DevConf India : https://devconf.info/in (event dates 02, 03 Aug
2019, Bengaluru)
[2] DevConf USA : https://devconf.info/us/ (event dates 15 -17 Aug,
2019, Boston)

The DevConf events are well curated to get a good mix of developers
and users. This note is to raise awareness and encourage submission of
talks around Gluster, containerized storage and similar.
___
Gluster-users mailing list
Gluster-users@gluster.org
https://lists.gluster.org/mailman/listinfo/gluster-users


Re: [Gluster-users] Geo-replication status always on 'Created'

2019-03-28 Thread Maurya M
Hi,
 In my glusterd.log i am seeing this error message , is this related to the
patch i applied? or do i need to open a new thread?

 I [MSGID: 106327] [glusterd-geo-rep.c:4483:glusterd_read_status_file]
0-management: Using passed config
template(/var/lib/glusterd/geo-replication/vol_75a5fd373d88ba687f591f3353fa05cf_172.16.201.35_vol_e783a730578e45ed9d51b9a80df6c33f/gsyncd.conf).
[2019-03-28 10:39:29.493554] E [MSGID: 106293]
[glusterd-geo-rep.c:679:glusterd_query_extutil_generic] 0-management:
reading data from child failed
[2019-03-28 10:39:29.493589] E [MSGID: 106305]
[glusterd-geo-rep.c:4377:glusterd_fetch_values_from_config] 0-management:
Unable to get configuration data for
vol_75a5fd373d88ba687f591f3353fa05cf(master), 172.16.201.35:
:vol_e783a730578e45ed9d51b9a80df6c33f(slave)
[2019-03-28 10:39:29.493617] E [MSGID: 106328]
[glusterd-geo-rep.c:4517:glusterd_read_status_file] 0-management: Unable to
fetch config values for vol_75a5fd373d88ba687f591f3353fa05cf(master),
172.16.201.35::vol_e783a730578e45ed9d51b9a80df6c33f(slave). Trying default
config template
[2019-03-28 10:39:29.553846] E [MSGID: 106328]
[glusterd-geo-rep.c:4525:glusterd_read_status_file] 0-management: Unable to
fetch config values for vol_75a5fd373d88ba687f591f3353fa05cf(master),
172.16.201.35::vol_e783a730578e45ed9d51b9a80df6c33f(slave)
[2019-03-28 10:39:29.553836] E [MSGID: 106293]
[glusterd-geo-rep.c:679:glusterd_query_extutil_generic] 0-management:
reading data from child failed
[2019-03-28 10:39:29.553844] E [MSGID: 106305]
[glusterd-geo-rep.c:4377:glusterd_fetch_values_from_config] 0-management:
Unable to get configuration data for
vol_75a5fd373d88ba687f591f3353fa05cf(master), 172.16.201.35:
:vol_e783a730578e45ed9d51b9a80df6c33f(slave)

also while do a status call, i am not seeing one of the nodes which was
reporting 'Passive' before ( did not change any configuration ) , any ideas
how to troubleshoot this?

thanks for your help.

Maurya

On Tue, Mar 26, 2019 at 8:34 PM Aravinda  wrote:

> Please check error message in gsyncd.log file in
> /var/log/glusterfs/geo-replication/
>
> On Tue, 2019-03-26 at 19:44 +0530, Maurya M wrote:
> > Hi Arvind,
> >  Have patched my setup with your fix: re-run the setup, but this time
> > getting a different error where it failed to commit the ssh-port on
> > my other 2 nodes on the master cluster, so manually copied the :
> > [vars]
> > ssh-port = 
> >
> > into gsyncd.conf
> >
> > and status reported back is as shown below :  Any ideas how to
> > troubleshoot this?
> >
> > MASTER NODE  MASTER VOL  MASTER
> > BRICK
> >SLAVE USERSLAVE
> >   SLAVE NODE  STATUS
> >  CRAWL STATUSLAST_SYNCED
> > ---
> > ---
> > ---
> > ---
> > --
> > 172.16.189.4 vol_75a5fd373d88ba687f591f3353fa05cf
> > /var/lib/heketi/mounts/vg_aee3df7b0bb2451bc00a73358c5196a2/brick_116f
> > b9427fb26f752d9ba8e45e183cb1/brickroot
> > 172.16.201.35::vol_e783a730578e45ed9d51b9a80df6c33f172.16.201.4
> >   PassiveN/A N/A
> > 172.16.189.35vol_75a5fd373d88ba687f591f3353fa05cf
> > /var/lib/heketi/mounts/vg_05708751110fe60b3e7da15bdcf6d4d4/brick_266b
> > b08f0d466d346f8c0b19569736fb/brickroot
> > 172.16.201.35::vol_e783a730578e45ed9d51b9a80df6c33fN/A
> >  Faulty N/A N/A
> > 172.16.189.66vol_75a5fd373d88ba687f591f3353fa05cf
> > /var/lib/heketi/mounts/vg_4b92a2b687e59b7311055d3809b77c06/brick_dfa4
> > 4c9380cdedac708e27e2c2a443a0/brickroot
> > 172.16.201.35::vol_e783a730578e45ed9d51b9a80df6c33fN/A
> >  Initializing...N/A N/A
> >
> >
> >
> >
> > On Tue, Mar 26, 2019 at 1:40 PM Aravinda  wrote:
> > > I got chance to investigate this issue further and identified a
> > > issue
> > > with Geo-replication config set and sent patch to fix the same.
> > >
> > > BUG: https://bugzilla.redhat.com/show_bug.cgi?id=1692666
> > > Patch: https://review.gluster.org/22418
> > >
> > > On Mon, 2019-03-25 at 15:37 +0530, Maurya M wrote:
> > > > ran this command :  ssh -p  -i /var/lib/glusterd/geo-
> > > > replication/secret.pem root@gluster volume info --
> > > xml
> > > >
> > > > attaching the output.
> > > >
> > > >
> > > >
> > > > On Mon, Mar 25, 2019 at 2:13 PM Aravinda 
> > > wrote:
> > > > > Geo-rep is running `ssh -i /var/lib/glusterd/geo-
> > > > > replication/secret.pem
> > > > > root@ gluster volume info --xml` and parsing its
> > > output.
> > > > > Please try to to run the command from the same node and let us
> > > know
> > > > > the
> > > > > output.
> > > > >
> > > > >
> > > > > On Mon, 2019-03-25 at 11:43 +0530, Maurya M wrote:
> > > > > > Now the error is on the 

Re: [Gluster-users] Gluster GEO replication fault after write over nfs-ganesha

2019-03-28 Thread Soumya Koduri




On 3/27/19 7:39 PM, Alexey Talikov wrote:

I have two clusters with dispersed volumes (2+1) with GEO replication
It works fine till I use glusterfs-fuse, but as even one file written 
over nfs-ganesha replication goes to Fault and recovers after I remove 
this file (sometimes after stop/start)
I think nfs-hanesha writes file in some way that produces problem with 
replication




I am not much familiar with geo-rep and not sure what/why exactly failed 
here. Request Kotresh (cc'ed) to take a look and provide his insights on 
the issue.


Thanks,
Soumya

|OSError: [Errno 61] No data available: 
'.gfid/9c9514ce-a310-4a1c-a87b-a800a32a99f8' |


but if I check over glusterfs mounted with aux-gfid-mount

|getfattr -n trusted.glusterfs.pathinfo -e text 
/mnt/TEST/.gfid/9c9514ce-a310-4a1c-a87b-a800a32a99f8 getfattr: Removing 
leading '/' from absolute path names # file: 
mnt/TEST/.gfid/9c9514ce-a310-4a1c-a87b-a800a32a99f8 
trusted.glusterfs.pathinfo="( ( 
))" |


File exists
Details available here https://github.com/nfs-ganesha/nfs-ganesha/issues/408


___
Gluster-users mailing list
Gluster-users@gluster.org
https://lists.gluster.org/mailman/listinfo/gluster-users


___
Gluster-users mailing list
Gluster-users@gluster.org
https://lists.gluster.org/mailman/listinfo/gluster-users


Re: [Gluster-users] glusterfs 4.1.7 + nfs-ganesha 2.7.1 freeze during write

2019-03-28 Thread Soumya Koduri



On 2/8/19 11:53 AM, Soumya Koduri wrote:



On 2/8/19 3:20 AM, Maurits Lamers wrote:

Hi,



[2019-02-07 10:11:24.812606] E [MSGID: 104055] 
[glfs-fops.c:4955:glfs_cbk_upcall_data] 0-gfapi: Synctak for Upcall 
event_type(1) and gfid(yøêÙ

 Mz„–îSL4_@) failed
[2019-02-07 10:11:24.819376] E [MSGID: 104055] 
[glfs-fops.c:4955:glfs_cbk_upcall_data] 0-gfapi: Synctak for Upcall 
event_type(1) and gfid(eTnôEU«H.[2019-02-07 10:11:24.833299] E [MSGID: 104055] 
[glfs-fops.c:4955:glfs_cbk_upcall_data] 0-gfapi: Synctak for Upcall 
event_type(1) and gfid(gÇLÁèFà»0bЯk) failed
[2019-02-07 10:25:01.642509] C 
[rpc-clnt-ping.c:166:rpc_clnt_ping_timer_expired] 0-gv0-client-2: 
server [node1]:49152 has not responded in the last 42 seconds, 
disconnecting.
[2019-02-07 10:25:01.642805] C 
[rpc-clnt-ping.c:166:rpc_clnt_ping_timer_expired] 0-gv0-client-1: 
server [node2]:49152 has not responded in the last 42 seconds, 
disconnecting.
[2019-02-07 10:25:01.642946] C 
[rpc-clnt-ping.c:166:rpc_clnt_ping_timer_expired] 0-gv0-client-4: 
server [node3]:49152 has not responded in the last 42 seconds, 
disconnecting.
[2019-02-07 10:25:02.643120] C 
[rpc-clnt-ping.c:166:rpc_clnt_ping_timer_expired] 0-gv0-client-3: 
server 127.0.1.1:49152 has not responded in the last 42 seconds, 
disconnecting.
[2019-02-07 10:25:02.643314] C 
[rpc-clnt-ping.c:166:rpc_clnt_ping_timer_expired] 0-gv0-client-0: 
server [node4]:49152 has not responded in the last 42 seconds, 
disconnecting.


Strange that synctask failed. Could you please turn off 
features.cache-invalidation volume option and check if the issue 
still persists.




Turning the cache invalidation option off seems to have solved the 
freeze. Still testing, but it looks promising.




If thats the case, please turn on cache invalidation option back and 
collect couple of stack traces (using gstack) when the system freezes 
again.


FYI - Have got a chance to reproduce and RCA the issue [1]. Posted fix 
for review in the upstream [2]


Thanks,
Soumya

[1] https://bugzilla.redhat.com/show_bug.cgi?id=1693575
[2] https://review.gluster.org/22436



Thanks,
Soumya

cheers

Maurits


___
Gluster-users mailing list
Gluster-users@gluster.org
https://lists.gluster.org/mailman/listinfo/gluster-users

___
Gluster-users mailing list
Gluster-users@gluster.org
https://lists.gluster.org/mailman/listinfo/gluster-users

Re: [Gluster-users] Prioritise local bricks for IO?

2019-03-28 Thread Nithya Balachandran
On Wed, 27 Mar 2019 at 20:27, Poornima Gurusiddaiah 
wrote:

> This feature is not under active development as it was not used widely.
> AFAIK its not supported feature.
> +Nithya +Raghavendra for further clarifications.
>

This is not actively supported  - there has been no work done on this
feature for a long time.

Regards,
Nithya

>
> Regards,
> Poornima
>
> On Wed, Mar 27, 2019 at 12:33 PM Lucian  wrote:
>
>> Oh, that's just what the doctor ordered!
>> Hope it works, thanks
>>
>> On 27 March 2019 03:15:57 GMT, Vlad Kopylov  wrote:
>>>
>>> I don't remember if it still in works
>>> NUFA
>>>
>>> https://github.com/gluster/glusterfs-specs/blob/master/done/Features/nufa.md
>>>
>>> v
>>>
>>> On Tue, Mar 26, 2019 at 7:27 AM Nux!  wrote:
>>>
 Hello,

 I'm trying to set up a distributed backup storage (no replicas), but
 I'd like to prioritise the local bricks for any IO done on the volume.
 This will be a backup stor, so in other words, I'd like the files to be
 written locally if there is space, so as to save the NICs for other 
 traffic.

 Anyone knows how this might be achievable, if at all?

 --
 Sent from the Delta quadrant using Borg technology!

 Nux!
 www.nux.ro
 ___
 Gluster-users mailing list
 Gluster-users@gluster.org
 https://lists.gluster.org/mailman/listinfo/gluster-users

>>>
>> --
>> Sent from my Android device with K-9 Mail. Please excuse my brevity.
>> ___
>> Gluster-users mailing list
>> Gluster-users@gluster.org
>> https://lists.gluster.org/mailman/listinfo/gluster-users
>
> ___
> Gluster-users mailing list
> Gluster-users@gluster.org
> https://lists.gluster.org/mailman/listinfo/gluster-users
___
Gluster-users mailing list
Gluster-users@gluster.org
https://lists.gluster.org/mailman/listinfo/gluster-users

Re: [Gluster-users] POSIX locks and disconnections between clients and bricks

2019-03-28 Thread Raghavendra Gowdappa
On Thu, Mar 28, 2019 at 2:37 PM Xavi Hernandez  wrote:

> On Thu, Mar 28, 2019 at 3:05 AM Raghavendra Gowdappa 
> wrote:
>
>>
>>
>> On Wed, Mar 27, 2019 at 8:38 PM Xavi Hernandez 
>> wrote:
>>
>>> On Wed, Mar 27, 2019 at 2:20 PM Pranith Kumar Karampuri <
>>> pkara...@redhat.com> wrote:
>>>


 On Wed, Mar 27, 2019 at 6:38 PM Xavi Hernandez 
 wrote:

> On Wed, Mar 27, 2019 at 1:13 PM Pranith Kumar Karampuri <
> pkara...@redhat.com> wrote:
>
>>
>>
>> On Wed, Mar 27, 2019 at 5:13 PM Xavi Hernandez 
>> wrote:
>>
>>> On Wed, Mar 27, 2019 at 11:52 AM Raghavendra Gowdappa <
>>> rgowd...@redhat.com> wrote:
>>>


 On Wed, Mar 27, 2019 at 12:56 PM Xavi Hernandez <
 jaher...@redhat.com> wrote:

> Hi Raghavendra,
>
> On Wed, Mar 27, 2019 at 2:49 AM Raghavendra Gowdappa <
> rgowd...@redhat.com> wrote:
>
>> All,
>>
>> Glusterfs cleans up POSIX locks held on an fd when the
>> client/mount through which those locks are held disconnects from
>> bricks/server. This helps Glusterfs to not run into a stale lock 
>> problem
>> later (For eg., if application unlocks while the connection was still
>> down). However, this means the lock is no longer exclusive as other
>> applications/clients can acquire the same lock. To communicate that 
>> locks
>> are no longer valid, we are planning to mark the fd (which has POSIX 
>> locks)
>> bad on a disconnect so that any future operations on that fd will 
>> fail,
>> forcing the application to re-open the fd and re-acquire locks it 
>> needs [1].
>>
>
> Wouldn't it be better to retake the locks when the brick is
> reconnected if the lock is still in use ?
>

 There is also  a possibility that clients may never reconnect.
 That's the primary reason why bricks assume the worst (client will not
 reconnect) and cleanup the locks.

>>>
>>> True, so it's fine to cleanup the locks. I'm not saying that locks
>>> shouldn't be released on disconnect. The assumption is that if the 
>>> client
>>> has really died, it will also disconnect from other bricks, who will
>>> release the locks. So, eventually, another client will have enough 
>>> quorum
>>> to attempt a lock that will succeed. In other words, if a client gets
>>> disconnected from too many bricks simultaneously (loses Quorum), then 
>>> that
>>> client can be considered as bad and can return errors to the 
>>> application.
>>> This should also cause to release the locks on the remaining connected
>>> bricks.
>>>
>>> On the other hand, if the disconnection is very short and the client
>>> has not died, it will keep enough locked files (it has quorum) to avoid
>>> other clients to successfully acquire a lock. In this case, if the 
>>> brick is
>>> reconnected, all existing locks should be reacquired to recover the
>>> original state before the disconnection.
>>>
>>>

> BTW, the referenced bug is not public. Should we open another bug
> to track this ?
>

 I've just opened up the comment to give enough context. I'll open a
 bug upstream too.


>
>
>>
>> Note that with AFR/replicate in picture we can prevent errors to
>> application as long as Quorum number of children "never ever" lost
>> connection with bricks after locks have been acquired. I am using 
>> the term
>> "never ever" as locks are not healed back after re-connection and 
>> hence
>> first disconnect would've marked the fd bad and the fd remains so 
>> even
>> after re-connection happens. So, its not just Quorum number of 
>> children
>> "currently online", but Quorum number of children "never having
>> disconnected with bricks after locks are acquired".
>>
>
> I think this requisite is not feasible. In a distributed file
> system, sooner or later all bricks will be disconnected. It could be
> because of failures or because an upgrade is done, but it will happen.
>
> The difference here is how long are fd's kept open. If
> applications open and close files frequently enough (i.e. the fd is 
> not
> kept open more time than it takes to have more than Quorum bricks
> disconnected) then there's no problem. The problem can only appear on
> applications that open files for a long time and also use posix 
> locks. In
> this case, the only good solution I see is to retake the locks on 
> brick
> reconnection.
>

 

Re: [Gluster-users] POSIX locks and disconnections between clients and bricks

2019-03-28 Thread Xavi Hernandez
On Thu, Mar 28, 2019 at 3:05 AM Raghavendra Gowdappa 
wrote:

>
>
> On Wed, Mar 27, 2019 at 8:38 PM Xavi Hernandez 
> wrote:
>
>> On Wed, Mar 27, 2019 at 2:20 PM Pranith Kumar Karampuri <
>> pkara...@redhat.com> wrote:
>>
>>>
>>>
>>> On Wed, Mar 27, 2019 at 6:38 PM Xavi Hernandez 
>>> wrote:
>>>
 On Wed, Mar 27, 2019 at 1:13 PM Pranith Kumar Karampuri <
 pkara...@redhat.com> wrote:

>
>
> On Wed, Mar 27, 2019 at 5:13 PM Xavi Hernandez 
> wrote:
>
>> On Wed, Mar 27, 2019 at 11:52 AM Raghavendra Gowdappa <
>> rgowd...@redhat.com> wrote:
>>
>>>
>>>
>>> On Wed, Mar 27, 2019 at 12:56 PM Xavi Hernandez 
>>> wrote:
>>>
 Hi Raghavendra,

 On Wed, Mar 27, 2019 at 2:49 AM Raghavendra Gowdappa <
 rgowd...@redhat.com> wrote:

> All,
>
> Glusterfs cleans up POSIX locks held on an fd when the
> client/mount through which those locks are held disconnects from
> bricks/server. This helps Glusterfs to not run into a stale lock 
> problem
> later (For eg., if application unlocks while the connection was still
> down). However, this means the lock is no longer exclusive as other
> applications/clients can acquire the same lock. To communicate that 
> locks
> are no longer valid, we are planning to mark the fd (which has POSIX 
> locks)
> bad on a disconnect so that any future operations on that fd will 
> fail,
> forcing the application to re-open the fd and re-acquire locks it 
> needs [1].
>

 Wouldn't it be better to retake the locks when the brick is
 reconnected if the lock is still in use ?

>>>
>>> There is also  a possibility that clients may never reconnect.
>>> That's the primary reason why bricks assume the worst (client will not
>>> reconnect) and cleanup the locks.
>>>
>>
>> True, so it's fine to cleanup the locks. I'm not saying that locks
>> shouldn't be released on disconnect. The assumption is that if the client
>> has really died, it will also disconnect from other bricks, who will
>> release the locks. So, eventually, another client will have enough quorum
>> to attempt a lock that will succeed. In other words, if a client gets
>> disconnected from too many bricks simultaneously (loses Quorum), then 
>> that
>> client can be considered as bad and can return errors to the application.
>> This should also cause to release the locks on the remaining connected
>> bricks.
>>
>> On the other hand, if the disconnection is very short and the client
>> has not died, it will keep enough locked files (it has quorum) to avoid
>> other clients to successfully acquire a lock. In this case, if the brick 
>> is
>> reconnected, all existing locks should be reacquired to recover the
>> original state before the disconnection.
>>
>>
>>>
 BTW, the referenced bug is not public. Should we open another bug
 to track this ?

>>>
>>> I've just opened up the comment to give enough context. I'll open a
>>> bug upstream too.
>>>
>>>


>
> Note that with AFR/replicate in picture we can prevent errors to
> application as long as Quorum number of children "never ever" lost
> connection with bricks after locks have been acquired. I am using the 
> term
> "never ever" as locks are not healed back after re-connection and 
> hence
> first disconnect would've marked the fd bad and the fd remains so even
> after re-connection happens. So, its not just Quorum number of 
> children
> "currently online", but Quorum number of children "never having
> disconnected with bricks after locks are acquired".
>

 I think this requisite is not feasible. In a distributed file
 system, sooner or later all bricks will be disconnected. It could be
 because of failures or because an upgrade is done, but it will happen.

 The difference here is how long are fd's kept open. If applications
 open and close files frequently enough (i.e. the fd is not kept open 
 more
 time than it takes to have more than Quorum bricks disconnected) then
 there's no problem. The problem can only appear on applications that 
 open
 files for a long time and also use posix locks. In this case, the only 
 good
 solution I see is to retake the locks on brick reconnection.

>>>
>>> Agree. But lock-healing should be done only by HA layers like AFR/EC
>>> as only they know whether there are enough online bricks to have 
>>> prevented
>>> any conflicting lock. Protocol/client itself doesn't have enough
>>> information