Re: [Gluster-devel] [Gluster-users] Need clarification regarding the force option for snapshot delete.

2014-07-01 Thread Raghavendra Bhat

On Friday 27 June 2014 10:47 AM, Raghavendra Talur wrote:

Inline.

- Original Message -
From: Atin Mukherjee amukh...@redhat.com
To: Sachin Pandit span...@redhat.com, Gluster Devel 
gluster-devel@gluster.org, gluster-us...@gluster.org
Sent: Thursday, June 26, 2014 3:30:31 PM
Subject: Re: [Gluster-devel] Need clarification regarding the force option 
for snapshot delete.



On 06/26/2014 01:58 PM, Sachin Pandit wrote:

Hi all,

We had some concern regarding the snapshot delete force option,
That is the reason why we thought of getting advice from everyone out here.

Currently when we give gluster snapshot delete snapname, It gives a 
notification
saying that mentioned snapshot will be deleted, Do you still want to continue 
(y/n)?.
As soon as he presses y it will delete the snapshot.

Our new proposal is, When a user issues snapshot delete command without force
then the user should be given a notification saying to use force option to
delete the snap.

In that case gluster snapshot delete snapname becomes useless apart
from throwing a notification. If we can ensure snapshot delete all works
only with force option then we can have gluster snapshot delete
volname to work as it is now.

~Atin

Agree with Atin here, asking user to execute same command with force appended is
not right.



When snapshot delete command is issued with force option then the user should
be given a notification saying Mentioned snapshot will be deleted, Do you still
want to continue (y/n).

The reason we thought of bringing this up is because we have planned to 
introduce
a command gluster snapshot delete all which deletes all the snapshot in a 
system,
and gluster snapshot delete volume volname which deletes all the snapshots 
in
the mentioned volume. If user accidentally issues any one of the above mentioned
command and press y then he might lose few or more snapshot present in 
volume/system.
(Thinking it will ask for notification for each delete).

It will be good to have this feature, asking for y for every delete.
When force is used we don't ask confirmation for each. Similar to rm -f.

If that is not feasible as of now, is something like this better?

Case 1 : Single snap
[root@snapshot-24 glusterfs]# gluster snapshot delete snap1
Deleting snap will erase all the information about the snap.
Do you still want to continue? (y/n) y
[root@snapshot-24 glusterfs]#

Case 2: Delete all system snaps
[root@snapshot-24 glusterfs]# gluster snapshot delete all
Deleting N snaps stored on the system
Do you still want to continue? (y/n) y
[root@snapshot-24 glusterfs]#

Case 3: Delete all volume snaps
[root@snapshot-24 glusterfs]# gluster snapshot delete volume volname
Deleting N snaps for the volume volname
Do you still want to continue? (y/n) y
[root@snapshot-24 glusterfs]#

Idea here being, if the Warnings to different commands are different
then users may pause for  moment to read and check the message.
We can even list the snaps to be deleted even if we don't ask for
confirmation for each.

Raghavendra Talur


Agree with Raghavendra Talur. It would be better to ask the user without 
force option. The above method suggested by Talur seems to be neat.


Regards,
Raghavendra Bhat


Do you think notification would be more than enough, or do we need to introduce
a force option ?

--
Current procedure:
--

[root@snapshot-24 glusterfs]# gluster snapshot delete snap1
Deleting snap will erase all the information about the snap.
Do you still want to continue? (y/n)


Proposed procedure:
---

[root@snapshot-24 glusterfs]# gluster snapshot delete snap1
Please use the force option to delete the snap.

[root@snapshot-24 glusterfs]# gluster snapshot delete snap1 force
Deleting snap will erase all the information about the snap.
Do you still want to continue? (y/n)
--

We are looking forward for the feedback on this.

Thanks,
Sachin Pandit.

___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://supercolony.gluster.org/mailman/listinfo/gluster-devel


___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://supercolony.gluster.org/mailman/listinfo/gluster-devel



___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://supercolony.gluster.org/mailman/listinfo/gluster-devel


Re: [Gluster-devel] Feature review: Improved rebalance performance

2014-07-01 Thread Xavier Hernandez
On Monday 30 June 2014 16:18:09 Shyamsundar Ranganathan wrote:
  Will this rebalance on access feature be enabled always or only during a
  brick addition/removal to move files that do not go to the affected brick
  while the main rebalance is populating or removing files from the brick ?
 
 The rebalance on access, in my head, stands as follows, (a little more
 detailed than what is in the feature page) Step 1: Initiation of the
 process
 - Admin chooses to rebalance _changed_ bricks
   - This could mean added/removed/changed size bricks
 [3]- Rebalance on access is triggered, so as to move files when they are
 accessed but asynchronously [1]- Background rebalance, acts only to
 (re)move data (from)to these bricks [2]- This would also change the layout
 for all directories, to include the new configuration of the cluster, so
 that newer data is placed in the correct bricks
 
 Step 2: Completion of background rebalance
 - Once background rebalance is complete, the rebalance status is noted as
 success/failure based on what the backgrould rebalance process did - This
 will not stop the on access rebalance, as data is still all over the place,
 and enhancements like lookup-unhashed=auto will have trouble

I don't see why stopping rebalance on access when lookup-unhashed=auto is a 
problem. If I understand http://review.gluster.org/7702/ correctly, when the 
directory commit hash does not match that of the volume root, a global lookup 
will be made. If we change layout in [3], it will also change (or it should) 
the commit of the directory. This means that even if files of that directory 
are not rebalanced yet, they will be found regardless if on access rebalance 
is enabled or not.

Am I missing something ?

 
 Step 3: Admin can initiate a full rebalance
 - When this is complete then the on access rebalance would be turned off, as
 the cluster is rebalanced!
 
 Step 2.5/4: Choosing to stop the on access rebalance
 - This can be initiated by the admin, post 3 which is more logical or
 between 2 and 3, in which case lookup everywhere for files etc. cannot be
 avoided due to [2] above
 

I like having the possibility for admins to enable/disable this feature seems 
interesting. However I also think this should be forcibly enabled when 
rebalancing _changed_ bricks.

 Issues and possible solutions:
 
 [4] One other thought is to create link files, as a part of [1], for files
 that do not belong to the right bricks but are _not_ going to be rebalanced
 as their source/destination is not a changed brick. This _should_ be faster
 than moving data around and rebalancing these files. It should also avoid
 the problem that, post a rebalance _changed_ command, the cluster may
 have files in the wrong place based on the layout, as the link files would
 be present to correct the situation. In this situation the rebalance on
 access can be left on indefinitely and turning it off does not serve much
 purpose.
 

I think that creating link files is a cheap task, specially if rebalance will 
handle files in parallel. However I'm not sure if this will make any 
measurable difference in performance on future accesses (in theory it should 
avoid a global lookup once). This would need to be tested to decide.

 Enabling rebalance on access always is fine, but I am not sure it buys us
 gluster states that mean the cluster is in a balanced situation, for other
 actions like the lookup-unhashed mentioned which may not just need the link
 files in place. Examples could be mismatched or overly space committed
 bricks with old, not accessed data etc. but do not have a clear example
 yet.
 

As I see it, rebalance on access should be a complement to normal rebalance to 
keep the volume _more_ balanced (keep accessed files on the right brick to 
avoid unnecessary delays due to global lookups or link file redirections), but 
it can not assure that the volume is fully rebalanced.

 Just stating, the core intention of rebalance _changed_ is to create space
 in existing bricks when the cluster grows faster, or be able to remove
 bricks from the cluster faster.
 

That is a very important feature. I've missed it several times when expanding 
a volume. In fact we needed to write some scripts to do something similar 
before launching a full rebalance.

 Redoing a rebalance _changed_ again due to a gluster configuration change,
 i.e expanding the cluster again say, needs some thought. It does not impact
 if rebalance on access is running or not, the only thing it may impact is
 the choice of files that are already put into the on access queue based on
 the older layout, due to the older cluster configuration. Just noting this
 here.
 

This will need to be thought more deeply, but if we only have a queue of files 
that *may* need migration, and we really check the target volume at the time 
of migration, I think this won't pose much problem in case of successive 
rebalances.

 In short if we do [4] then we can leave rebalance on access turned 

Re: [Gluster-devel] Feature review: Improved rebalance performance

2014-07-01 Thread Raghavendra Gowdappa


- Original Message -
 From: Shyamsundar Ranganathan srang...@redhat.com
 To: Xavier Hernandez xhernan...@datalab.es
 Cc: gluster-devel@gluster.org
 Sent: Tuesday, July 1, 2014 1:48:09 AM
 Subject: Re: [Gluster-devel] Feature review: Improved rebalance performance
 
  From: Xavier Hernandez xhernan...@datalab.es
  
  Hi Shyam,
  
  On Thursday 26 June 2014 14:41:13 Shyamsundar Ranganathan wrote:
   It also touches upon a rebalance on access like mechanism where we could
   potentially, move data out of existing bricks to a newer brick faster, in
   the case of brick addition, and vice versa for brick removal, and heal
   the
   rest of the data on access.
   
  Will this rebalance on access feature be enabled always or only during a
  brick addition/removal to move files that do not go to the affected brick
  while the main rebalance is populating or removing files from the brick ?
 
 The rebalance on access, in my head, stands as follows, (a little more
 detailed than what is in the feature page)
 Step 1: Initiation of the process
 - Admin chooses to rebalance _changed_ bricks
   - This could mean added/removed/changed size bricks
 [3]- Rebalance on access is triggered, so as to move files when they are
 accessed but asynchronously
 [1]- Background rebalance, acts only to (re)move data (from)to these bricks
   [2]- This would also change the layout for all directories, to include the
   new configuration of the cluster, so that newer data is placed in the
   correct bricks
 
 Step 2: Completion of background rebalance
 - Once background rebalance is complete, the rebalance status is noted as
 success/failure based on what the backgrould rebalance process did
 - This will not stop the on access rebalance, as data is still all over the
 place, and enhancements like lookup-unhashed=auto will have trouble
 
 Step 3: Admin can initiate a full rebalance
 - When this is complete then the on access rebalance would be turned off, as
 the cluster is rebalanced!
 
 Step 2.5/4: Choosing to stop the on access rebalance
 - This can be initiated by the admin, post 3 which is more logical or between
 2 and 3, in which case lookup everywhere for files etc. cannot be avoided
 due to [2] above
 
 Issues and possible solutions:
 
 [4] One other thought is to create link files, as a part of [1], for files
 that do not belong to the right bricks but are _not_ going to be rebalanced
 as their source/destination is not a changed brick. This _should_ be faster
 than moving data around and rebalancing these files. It should also avoid
 the problem that, post a rebalance _changed_ command, the cluster may have
 files in the wrong place based on the layout, as the link files would be
 present to correct the situation. In this situation the rebalance on access
 can be left on indefinitely and turning it off does not serve much purpose.
 
 Enabling rebalance on access always is fine, but I am not sure it buys us
 gluster states that mean the cluster is in a balanced situation, for other
 actions like the lookup-unhashed mentioned which may not just need the link
 files in place. Examples could be mismatched or overly space committed
 bricks with old, not accessed data etc. but do not have a clear example yet.
 
 Just stating, the core intention of rebalance _changed_ is to create space
 in existing bricks when the cluster grows faster, or be able to remove
 bricks from the cluster faster.
 
 Redoing a rebalance _changed_ again due to a gluster configuration change,
 i.e expanding the cluster again say, needs some thought. It does not impact
 if rebalance on access is running or not, the only thing it may impact is
 the choice of files that are already put into the on access queue based on
 the older layout, due to the older cluster configuration. Just noting this
 here.
 
 In short if we do [4] then we can leave rebalance on access turned on always,
 unless we have some other counter examples or use cases that are not thought
 of. Doing [4] seems logical, so I would state that we should, but from a
 performance angle of improving rebalance, we need to determine the worth
 against access paths from IO post not having [4] (again considering the
 improvement that lookup-unhashed brings, this maybe obvious that [4] should
 be done).
 
 A note on [3], the intention is to start an asynchronous sync task that
 rebalances the file on access, and not impact the IO path. So if a file is
 chosen by the IO path as to needing a rebalance, then a sync task with the
 required xattr to trigger a file move is setup, and setxattr is called, that
 should take care of the file migration and enabling the IO path to progress
 as is.
 
 Reading through your mail, a better way of doing this by sharing the load,
 would be to use an index, so that each node in the cluster has a list of
 files accessed that need a rebalance. The above method for [3] would be
 client heavy and would incur a network read and write, whereas the index
 manner of doing 

Re: [Gluster-devel] Feature review: Improved rebalance performance

2014-07-01 Thread Raghavendra Gowdappa


- Original Message -
 From: Xavier Hernandez xhernan...@datalab.es
 To: Raghavendra Gowdappa rgowd...@redhat.com
 Cc: Shyamsundar Ranganathan srang...@redhat.com, gluster-devel@gluster.org
 Sent: Tuesday, July 1, 2014 3:10:29 PM
 Subject: Re: [Gluster-devel] Feature review: Improved rebalance performance
 
 On Tuesday 01 July 2014 02:37:34 Raghavendra Gowdappa wrote:
   Another thing to consider for future versions is to modify the current
   DHT
   to a consistent hashing and even the hash value (using gfid instead of a
   hash of the name would solve the rename problem). The consistent hashing
   would drastically reduce the number of files that need to be moved and
   already solves some of the current problems. This change needs a lot of
   thinking though.
  
  The problem with using gfid for hashing instead of name is that we run into
  a chicken and egg problem. Before lookup, we cannot know the gfid of the
  file and to lookup the file, we need gfid to find out the node in which
  file resides. Of course, this problem would go away if we lookup (may be
  just during fresh lookups) on all the nodes, but that slows down the fresh
  lookups and may not be acceptable.
 
 I think it's not so problematic, and the benefits would be considerable.
 
 The gfid of the root directory is always known. This means that we could
 always do a lookup on root by gfid.
 
 I haven't tested it but as I understand it, when you want to do a getxattr on
 a file inside a subdirectory, for example, the kernel will issue lookups on
 all intermediate directories to check,

Yes, but how does dht handle these lookups? Are you suggesting that we wind the 
lookup call to all subvolumes (since we don't know which subvolume the file is 
present for lack of gfid)?

 at least, the access rights before
 finally reading the xattr of the file. This means that we can get and cache
 gfid's of all intermediate directories in the process.
 
 Even if there's some operation that does not issue a previous lookup, we
 could
 do that lookup if it's not cached. Of course if there were many more
 operations not issuing a previous lookup, this solution won't be good, but I
 think this is not the case.
 
 I'll try to do some tests to see if this is correct.
 
 Xavi
 
___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://supercolony.gluster.org/mailman/listinfo/gluster-devel


Re: [Gluster-devel] Feature review: Improved rebalance performance

2014-07-01 Thread Xavier Hernandez
On Tuesday 01 July 2014 05:55:51 Raghavendra Gowdappa wrote:
 - Original Message -
Another thing to consider for future versions is to modify the current
DHT
to a consistent hashing and even the hash value (using gfid instead of
a
hash of the name would solve the rename problem). The consistent
hashing
would drastically reduce the number of files that need to be moved and
already solves some of the current problems. This change needs a lot
of
thinking though.
   
   The problem with using gfid for hashing instead of name is that we run
   into
   a chicken and egg problem. Before lookup, we cannot know the gfid of the
   file and to lookup the file, we need gfid to find out the node in which
   file resides. Of course, this problem would go away if we lookup (may be
   just during fresh lookups) on all the nodes, but that slows down the
   fresh
   lookups and may not be acceptable.
  
  I think it's not so problematic, and the benefits would be considerable.
  
  The gfid of the root directory is always known. This means that we could
  always do a lookup on root by gfid.
  
  I haven't tested it but as I understand it, when you want to do a getxattr
  on a file inside a subdirectory, for example, the kernel will issue
  lookups on all intermediate directories to check,
 
 Yes, but how does dht handle these lookups? Are you suggesting that we wind
 the lookup call to all subvolumes (since we don't know which subvolume the
 file is present for lack of gfid)?

Oops, that's true. It only works combined with another idea we had about 
storing directories as special files (using the same redundancy as normal 
files). This way a lookup for an entry would be translated to a special lookup 
for the parent directory (we know where it is and its gfid) asking for an 
specific entry that will return its gfid (and probably some other info). Of 
course this has more implications like that the bricks won't be able to 
maintain a (partial) view of the file system like now.

Right now, using gfid as the hash key is not possible because this would need 
asking to each subvolume on lookups as you say, and this is not efficient.

The solution I commented would need some important architectural changes. It 
could be an option to consider for 4.0.

Xavi

___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://supercolony.gluster.org/mailman/listinfo/gluster-devel


[Gluster-devel] Update on better peer identification

2014-07-01 Thread Kaushal M
Hi everyone,

As everyone hopefully knows by now, improving the peer identification
mechanism within Glusterd is one of the features being targeted for
glusterfs-3.6. [0]

I had proposed this a while back, but had not been able to do much
work related to this till now. Varun (CCd) and I have been working on
this since last week, and are hoping to get at least the base
framework ready and merged into 3.6.

The main problem we have currently is that a peer can be associated
with just a single address. This is stored in peerinfo-hostname. This
isn't sufficient to address all possible addresses a peer could be
associated with and lead to failures identifying peers during various
commands. Also our identification isn't able to correctly match
shortname, fqdns and ips, which could also lead to failures.

Varun and me are hoping to solve these two problems and are currently
working towards it. We hope to do the following,
1. Extend peerinfo to hold a list of addresses instead of a single address.
2. Improve peer probe to add unknown addresses to this list when we
identify that it belongs to a known peer
3. Improve glusterd_friend_by_hostname helper to correctly handle
matching addresses.

These 3 changes should lay down the base for improving glusterds peer
identification problems.

I've updated the feature page [0] with the latest details.

We have been doing changes over the past week, and our changes can be
viewed on my glusterfs forks on forge [1] and github [2].

We currently don't have any changeset for review on gerrit, but if
anyone wants to review code right now, you can use [3].

Thanks.

~kaushal

[0] - 
http://www.gluster.org/community/documentation/index.php/Features/Better_peer_identification
[1] - 
https://forge.gluster.org/~kshlm/glusterfs-core/kshlms-glusterfs/commits/better-peer-identification
[2] - https://github.com/kshlm/glusterfs/tree/better-peer-identification
[3] - https://github.com/kshlm/glusterfs/pull/2/files
___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://supercolony.gluster.org/mailman/listinfo/gluster-devel


Re: [Gluster-devel] [Gluster-users] Need clarification regarding the force option for snapshot delete.

2014-07-01 Thread Sachin Pandit
Thank you all for the feedback.
Following will be the display shown to the user for snapshot delete command.

---
Case 1 : Single snap
[root@snapshot-24 glusterfs]# gluster snapshot delete snap-name
Deleting snap will erase all the information about the snap.
Do you still want to continue? (y/n) y
snapshot delete : snap-name deleted successfully.
[root@snapshot-24 glusterfs]#

---
Case 2: Delete all snaps present in system
[root@snapshot-24 glusterfs]# gluster snapshot delete all
Deleting N snaps stored on the system
Do you still want to continue? (y/n) y
snapshot delete : snap1 deleted successfully.
snapshot delete : snap2 deleted successfully.
.
.
snapshot delete : snapn deleted successfully.
[root@snapshot-24 glusterfs]#


Case 3: Delete all snaps present in a volume
[root@snapshot-24 glusterfs]# gluster snapshot delete volume volname
Deleting N snaps for the volume volname
Do you still want to continue? (y/n) y
snapshot delete : snap1 deleted successfully.
snapshot delete : snap2 deleted successfully.
.
.
snapshot delete : snapn deleted successfully.
[root@snapshot-24 glusterfs]#

---
- Original Message -
From: Raghavendra Bhat rab...@redhat.com
To: gluster-us...@gluster.org, gluster-devel@gluster.org
Sent: Tuesday, July 1, 2014 12:18:17 PM
Subject: Re: [Gluster-devel] [Gluster-users] Need clarification regarding the 
force option for snapshot delete.

On Friday 27 June 2014 10:47 AM, Raghavendra Talur wrote:
 Inline.

 - Original Message -
 From: Atin Mukherjee amukh...@redhat.com
 To: Sachin Pandit span...@redhat.com, Gluster Devel 
 gluster-devel@gluster.org, gluster-us...@gluster.org
 Sent: Thursday, June 26, 2014 3:30:31 PM
 Subject: Re: [Gluster-devel] Need clarification regarding the force option 
 for snapshot delete.



 On 06/26/2014 01:58 PM, Sachin Pandit wrote:
 Hi all,

 We had some concern regarding the snapshot delete force option,
 That is the reason why we thought of getting advice from everyone out here.

 Currently when we give gluster snapshot delete snapname, It gives a 
 notification
 saying that mentioned snapshot will be deleted, Do you still want to 
 continue (y/n)?.
 As soon as he presses y it will delete the snapshot.

 Our new proposal is, When a user issues snapshot delete command without 
 force
 then the user should be given a notification saying to use force option to
 delete the snap.
 In that case gluster snapshot delete snapname becomes useless apart
 from throwing a notification. If we can ensure snapshot delete all works
 only with force option then we can have gluster snapshot delete
 volname to work as it is now.

 ~Atin

 Agree with Atin here, asking user to execute same command with force appended 
 is
 not right.


 When snapshot delete command is issued with force option then the user 
 should
 be given a notification saying Mentioned snapshot will be deleted, Do you 
 still
 want to continue (y/n).

 The reason we thought of bringing this up is because we have planned to 
 introduce
 a command gluster snapshot delete all which deletes all the snapshot in a 
 system,
 and gluster snapshot delete volume volname which deletes all the 
 snapshots in
 the mentioned volume. If user accidentally issues any one of the above 
 mentioned
 command and press y then he might lose few or more snapshot present in 
 volume/system.
 (Thinking it will ask for notification for each delete).
 It will be good to have this feature, asking for y for every delete.
 When force is used we don't ask confirmation for each. Similar to rm -f.

 If that is not feasible as of now, is something like this better?

 Case 1 : Single snap
 [root@snapshot-24 glusterfs]# gluster snapshot delete snap1
 Deleting snap will erase all the information about the snap.
 Do you still want to continue? (y/n) y
 [root@snapshot-24 glusterfs]#

 Case 2: Delete all system snaps
 [root@snapshot-24 glusterfs]# gluster snapshot delete all
 Deleting N snaps stored on the system
 Do you still want to continue? (y/n) y
 [root@snapshot-24 glusterfs]#

 Case 3: Delete all volume snaps
 [root@snapshot-24 glusterfs]# gluster snapshot delete volume volname
 Deleting N snaps for the volume volname
 Do you still want to continue? (y/n) y
 [root@snapshot-24 glusterfs]#

 Idea here being, if the Warnings to different commands are different
 then users may pause for  moment to read and check the message.
 We can even list the snaps to be deleted even if we don't ask for
 confirmation for each.

 Raghavendra Talur

Agree with Raghavendra Talur. It would be better to ask the user without 
force option. The above method suggested by Talur seems to be neat.

Regards,
Raghavendra Bhat

 Do you think notification would be more than enough, or do we need to 
 introduce
 a force option ?

 

[Gluster-devel] Error coalesce for erasure code xlator

2014-07-01 Thread Xavier Hernandez
Hi,

while the erasure code xlator is being reviewed, I'm thinking about how to 
handle some kinds of errors.

In normal circumstances all bricks will give the same answers to the same 
requests, however, after some brick failures, underlying file system 
corruption or any other factors, it's possible that bricks give different 
answers to the same request.

For example, an 'unlink' request could succeed on some bricks and fail on 
others. Currently, the most common answer is taken as the good one only if 
it reaches a minimum amount of quorum, but if there isn't enough quorum, it 
fails with EIO.

Not having enough quorum means that more than R (redundancy) bricks have 
failed simultaneously (or have failed while another brick was alive but not 
recovered yet), which means that it's outside of the defined work conditions. 
However in some circumstances this could be improved.

Supose that the reason of failure of the unlink operation on some brick is 
ENOENT. We could consider this answer as a success and combine it with the 
other successful answers, giving more chances to reach the quorum minimum. Of 
course this will depend on the operation. If the operation were an open 
instead of an unlink, this combination won't be possible.

This can also be applied to error codes. In the same case, ENOENT and ENOTDIR 
errors could be combined, because they basically mean the same (relative to 
the file in question). Even in an open operation these two answers could be 
combined to give a more detailed error instead of EIO.

The only possible combinations I see are:

* Coalesce an error answer with a success answer
* Coalesce two different error answers

I don't see any case where two different success answers could be combined.

Would this be interesting to have for ec ?

Any thoughts/ideas/feedback will be welcome.

Xavi

___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://supercolony.gluster.org/mailman/listinfo/gluster-devel


[Gluster-devel] Reduce number of inodelk/entrylk calls on ec xlator

2014-07-01 Thread Xavier Hernandez
Hi,

current implementation of ec xlator uses inodelk/entrylk before each operation 
to guarantee exclusive access to the inode. This implementation blocks any 
other request to the same inode/entry until the previous operation has 
completed and unlocked it.

This adds a lot of latency to each operation, even if there are no conflicts 
with other clients. To improve this I was thinking to implement something 
similar to eager-locking and piggy-backing.

The following is an schematic description of the idea:

* Each operation will build a list of things to be locked (this could be 1
  inode or up to 2 entries).
* For each lock in the list:
   * If the lock is already acquired by another operation, it will add itself
 to a list of waiting operations associated to the operation that
 currently holds the lock.
   * If the lock is not acquired, it will initiate the normal inodelk/entrylk
 calls.
   * The locks will be acquired in a special order to guarantee that there
 couldn't be deadlocks.
* When the operation that is currently holding the lock terminates, it will
  test if there are waiting operations on it before unlocking. If so, it will
  resume execution of the next operation without unlocking.
* In the same way, xattr updating after operation will be delayed if another
  request was waiting to modify the same inode.

The case with 2 locks must be analyzed deeper to guarantee that intermediate 
states combined with other operations don't generate deadlocks.

To avoid stalls of other clients I'm thinking to use GLUSTERFS_OPEN_FD_COUNT 
to see if the same file is open by other clients. In this case, the operation 
will unlock the inode even if there are other operations waiting. Once the 
unlock is finished, the waiting operation will restart the inodelk/entrylk 
procedure.

Do you think this is a good approximation ?

Any thoughts/ideas/feedback will be welcome.

Xavi
___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://supercolony.gluster.org/mailman/listinfo/gluster-devel


Re: [Gluster-devel] Reduce number of inodelk/entrylk calls on ec xlator

2014-07-01 Thread haiwei.xie-soulinfo
hi Xavi, 

   Writev inodelk lock whole file, so write speed is bad. If 
inodelk(offset,len), 
IDA_KEY_SIZE xattr will be not consistent crossing bricks from unorder writev.

   So how about just use IDA_KEY_VERSION and bricks ia_size to check data crash?
Drop IDA_KEY_SIZE, and lookup lock whole file, readv lock (offset,len).
 
   I guess, this can get good performance and data consistent. 

   Thanks.

-terrs

 Hi,
 
 current implementation of ec xlator uses inodelk/entrylk before each 
 operation 
 to guarantee exclusive access to the inode. This implementation blocks any 
 other request to the same inode/entry until the previous operation has 
 completed and unlocked it.
 
 This adds a lot of latency to each operation, even if there are no conflicts 
 with other clients. To improve this I was thinking to implement something 
 similar to eager-locking and piggy-backing.
 
 The following is an schematic description of the idea:
 
 * Each operation will build a list of things to be locked (this could be 1
   inode or up to 2 entries).
 * For each lock in the list:
* If the lock is already acquired by another operation, it will add itself
  to a list of waiting operations associated to the operation that
  currently holds the lock.
* If the lock is not acquired, it will initiate the normal inodelk/entrylk
  calls.
* The locks will be acquired in a special order to guarantee that there
  couldn't be deadlocks.
 * When the operation that is currently holding the lock terminates, it will
   test if there are waiting operations on it before unlocking. If so, it will
   resume execution of the next operation without unlocking.
 * In the same way, xattr updating after operation will be delayed if another
   request was waiting to modify the same inode.
 
 The case with 2 locks must be analyzed deeper to guarantee that intermediate 
 states combined with other operations don't generate deadlocks.
 
 To avoid stalls of other clients I'm thinking to use GLUSTERFS_OPEN_FD_COUNT 
 to see if the same file is open by other clients. In this case, the operation 
 will unlock the inode even if there are other operations waiting. Once the 
 unlock is finished, the waiting operation will restart the inodelk/entrylk 
 procedure.
 
 Do you think this is a good approximation ?
 
 Any thoughts/ideas/feedback will be welcome.
 
 Xavi
 ___
 Gluster-devel mailing list
 Gluster-devel@gluster.org
 http://supercolony.gluster.org/mailman/listinfo/gluster-devel

___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://supercolony.gluster.org/mailman/listinfo/gluster-devel


Re: [Gluster-devel] Reduce number of inodelk/entrylk calls on ec xlator

2014-07-01 Thread Xavier Hernandez
On Tuesday 01 July 2014 21:37:57 haiwei.xie-soulinfo wrote:
 hi Xavi,
 
Writev inodelk lock whole file, so write speed is bad. If
 inodelk(offset,len), IDA_KEY_SIZE xattr will be not consistent crossing
 bricks from unorder writev.
 
So how about just use IDA_KEY_VERSION and bricks ia_size to check data
 crash? Drop IDA_KEY_SIZE, and lookup lock whole file, readv lock
 (offset,len).
 
I guess, this can get good performance and data consistent.

File version needs to be updated exactly at the same order on all bricks, like 
size. Allowing unordered writes can generate undetectable inconsistent data if 
two bricks fail at the same time, but have written different things.

Xavi
___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://supercolony.gluster.org/mailman/listinfo/gluster-devel


Re: [Gluster-devel] Feature review: Improved rebalance performance

2014-07-01 Thread Shyamsundar Ranganathan
 From: Xavier Hernandez xhernan...@datalab.es
 On Monday 30 June 2014 16:18:09 Shyamsundar Ranganathan wrote:
   Will this rebalance on access feature be enabled always or only during
   a
   brick addition/removal to move files that do not go to the affected brick
   while the main rebalance is populating or removing files from the brick ?
  
  The rebalance on access, in my head, stands as follows, (a little more
  detailed than what is in the feature page) Step 1: Initiation of the
  process
  - Admin chooses to rebalance _changed_ bricks
- This could mean added/removed/changed size bricks
  [3]- Rebalance on access is triggered, so as to move files when they are
  accessed but asynchronously [1]- Background rebalance, acts only to
  (re)move data (from)to these bricks [2]- This would also change the layout
  for all directories, to include the new configuration of the cluster, so
  that newer data is placed in the correct bricks
  
  Step 2: Completion of background rebalance
  - Once background rebalance is complete, the rebalance status is noted as
  success/failure based on what the backgrould rebalance process did - This
  will not stop the on access rebalance, as data is still all over the place,
  and enhancements like lookup-unhashed=auto will have trouble
 
 I don't see why stopping rebalance on access when lookup-unhashed=auto is a
 problem. If I understand http://review.gluster.org/7702/ correctly, when the
 directory commit hash does not match that of the volume root, a global lookup
 will be made. If we change layout in [3], it will also change (or it should)
 the commit of the directory. This means that even if files of that directory
 are not rebalanced yet, they will be found regardless if on access rebalance
 is enabled or not.
 
 Am I missing something ?

The comment was more to state that, the speed up gained by lookup-unhashed 
would be lost for the time that the cluster is not rebalanced completely, or 
has not noted all redirection as link files. The feature will work, but 
sub-optimally, and we need to consider/reduce the time for which this 
sub-optimal behavior is in effect.

 
  
  Step 3: Admin can initiate a full rebalance
  - When this is complete then the on access rebalance would be turned off,
  as
  the cluster is rebalanced!
  
  Step 2.5/4: Choosing to stop the on access rebalance
  - This can be initiated by the admin, post 3 which is more logical or
  between 2 and 3, in which case lookup everywhere for files etc. cannot be
  avoided due to [2] above
  
 
 I like having the possibility for admins to enable/disable this feature seems
 interesting. However I also think this should be forcibly enabled when
 rebalancing _changed_ bricks.

Yes, when rebalance _changed_ is in effect the rebalance on access is also in 
effect, noted in Step 1 of the elaboration above.

 
  Issues and possible solutions:
  
  [4] One other thought is to create link files, as a part of [1], for files
  that do not belong to the right bricks but are _not_ going to be rebalanced
  as their source/destination is not a changed brick. This _should_ be faster
  than moving data around and rebalancing these files. It should also avoid
  the problem that, post a rebalance _changed_ command, the cluster may
  have files in the wrong place based on the layout, as the link files would
  be present to correct the situation. In this situation the rebalance on
  access can be left on indefinitely and turning it off does not serve much
  purpose.
  
 
 I think that creating link files is a cheap task, specially if rebalance will
 handle files in parallel. However I'm not sure if this will make any
 measurable difference in performance on future accesses (in theory it should
 avoid a global lookup once). This would need to be tested to decide.

It would also avoid global lookup on create of new files when 
lookup-unhashed=auto is in force, so you find the file in the hashed subvol or 
not during creates to report EEXIST errors (as needed).

For a existing file lookup, yes the link file creation is triggered on the 
first lookup, which would do a global lookup, against the rebalance process 
ensuring these link files are present. Overall, it is better to have the link 
files created, so that create and existing lookups do not suffer the time and 
resource penalties is my thought.

 
  Enabling rebalance on access always is fine, but I am not sure it buys us
  gluster states that mean the cluster is in a balanced situation, for other
  actions like the lookup-unhashed mentioned which may not just need the link
  files in place. Examples could be mismatched or overly space committed
  bricks with old, not accessed data etc. but do not have a clear example
  yet.
  
 
 As I see it, rebalance on access should be a complement to normal rebalance
 to
 keep the volume _more_ balanced (keep accessed files on the right brick to
 avoid unnecessary delays due to global lookups or link file redirections),
 but
 it 

Re: [Gluster-devel] Update on better peer identification

2014-07-01 Thread Justin Clift
On 01/07/2014, at 11:30 AM, Kaushal M wrote:
snip
 Varun (CCd) and I have been working on
 this since last week, and are hoping to get at least the base
 framework ready and merged into 3.6.


Cool. Personally, I reckon this is extremely important, as a lot
of future changes will rely on it being in place. :)

+ Justin

--
GlusterFS - http://www.gluster.org

An open source, distributed file system scaling to several
petabytes, and handling thousands of clients.

My personal twitter: twitter.com/realjustinclift

___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://supercolony.gluster.org/mailman/listinfo/gluster-devel