[Gluster-devel] REMINDER: Gluster Community Bug Triage meeting today at 12:00 UTC

2014-11-25 Thread Niels de Vos
Hi all,

Later today we will have an other Gluster Community Bug Triage meeting.

Meeting details:
- location: #gluster-meeting on Freenode IRC
- date: every Tuesday
- time: 12:00 UTC, 13:00 CET (in your terminal, run: date -d 12:00 UTC)
- agenda: https://public.pad.fsfe.org/p/gluster-bug-triage

Currently the following items are listed:
* Roll Call
* Status of last weeks action items
* Group Triage
* Open Floor

The last two topics have space for additions. If you have a suitable bug
or topic to discuss, please add it to the agenda.

Your host today is LalatenduM. I'm unfortunately not avaialble this/my
afternoon.

Thanks,
Niels


pgpxtBs_NC6ek.pgp
Description: PGP signature
___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://supercolony.gluster.org/mailman/listinfo/gluster-devel


Re: [Gluster-devel] Wrong behavior on fsync of md-cache ?

2014-11-25 Thread Xavier Hernandez

On 11/25/2014 07:38 AM, Raghavendra Gowdappa wrote:

- Original Message -

From: Xavier Hernandez xhernan...@datalab.es
To: Raghavendra Gowdappa rgowd...@redhat.com
Cc: Gluster Devel gluster-devel@gluster.org, Emmanuel Dreyfus 
m...@netbsd.org
Sent: Tuesday, November 25, 2014 12:49:03 AM
Subject: Re: Wrong behavior on fsync of md-cache ?

I think the problem is here: the first thing wb_fsync()
checks is if there's an error in the fd (wd_fd_err()). If that's the
case, the call is immediately unwinded with that error. The error seems
to be set in wb_fulfill_cbk(). I don't know the internals of write-back
xlator, but this seems to be the problem.


Yes, your analysis is correct. Once the error is hit, fsync is not
queued  behind unfulfilled writes. Whether it can be considered as a bug
is debatable.  Since there is already an error in one of the writes which
was written-behind  fsync should return the error. I am not sure whether
it should wait till we try to flush _all_ the writes that were written
behind. Any suggestions on what is the expected behaviour here?



I think that it should wait for all pending writes. In the test case I 
used, all pending writes will fail the same way that the first one, but 
in other situations it's possible to have a write failing (for example 
due to a damaged block in disk) and following writes succeeding.


From the man page of fsync:

fsync() transfers (flushes) all modified in-core data of (i.e.,
modified buffer cache pages for) the file referred to by the file
descriptor fd to the disk device (or other permanent storage
device) so that all changed information can be retrieved even after
the system crashed or was rebooted. This includes writing through
or flushing a disk cache if present. The call blocks until the
device reports that the transfer has completed. It also flushes
metadata information associated with the file (see stat(2)).

As I understand it, when fsync is received all queued writes must be 
sent to the device (regardless if a previous write has failed or not). 
It also says that the call blocks until the device has finished all the 
operations.


However it's not clear to me how to control file consistency because 
this allows some writes to succeed after a failed one. I assume that 
controlling this is the responsibility of the calling application that 
should issue fsyncs on critical points to guarantee consistency.


Anyway it seems that there's a difference between linux and NetBSD 
because this test only fails on NetBSD. Is it possible that linux's fuse 
implementation delays the fsync request until all pending writes have 
been answered ? this would explain why this problem has not manifested 
till now. NetBSD seems to send fsync (probably as the first step of a 
close() call) when the first write fails.


Xavi
___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://supercolony.gluster.org/mailman/listinfo/gluster-devel


Re: [Gluster-devel] Wrong behavior on fsync of md-cache ?

2014-11-25 Thread Emmanuel Dreyfus
On Tue, Nov 25, 2014 at 09:35:25AM +0100, Xavier Hernandez wrote:
 Anyway it seems that there's a difference between linux and NetBSD because
 this test only fails on NetBSD. Is it possible that linux's fuse
 implementation delays the fsync request until all pending writes have been
 answered ? this would explain why this problem has not manifested till now.
 NetBSD seems to send fsync (probably as the first step of a close() call)
 when the first write fails.

I confirm that NetBSD FUSE sends a fsync before dropping the last 
reference on the vnode. That happens on close and it means the last
close will wait for data to be sync on disk. At that time there can be 
pending writes because of page cache flush which is done asynchrnously:
write system calls returns after storing data in page cache, and cache
is flushed to the filesystem later.

The kernel also flush page cache and sends fsyncs at regular time and
data written interval.

-- 
Emmanuel Dreyfus
m...@netbsd.org
___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://supercolony.gluster.org/mailman/listinfo/gluster-devel


Re: [Gluster-devel] Wrong behavior on fsync of md-cache ?

2014-11-25 Thread Raghavendra Gowdappa


- Original Message -
 From: Xavier Hernandez xhernan...@datalab.es
 To: Raghavendra Gowdappa rgowd...@redhat.com
 Cc: Gluster Devel gluster-devel@gluster.org, Emmanuel Dreyfus 
 m...@netbsd.org
 Sent: Tuesday, November 25, 2014 2:05:25 PM
 Subject: Re: Wrong behavior on fsync of md-cache ?
 
 On 11/25/2014 07:38 AM, Raghavendra Gowdappa wrote:
  - Original Message -
  From: Xavier Hernandez xhernan...@datalab.es
  To: Raghavendra Gowdappa rgowd...@redhat.com
  Cc: Gluster Devel gluster-devel@gluster.org, Emmanuel Dreyfus
  m...@netbsd.org
  Sent: Tuesday, November 25, 2014 12:49:03 AM
  Subject: Re: Wrong behavior on fsync of md-cache ?
 
  I think the problem is here: the first thing wb_fsync()
  checks is if there's an error in the fd (wd_fd_err()). If that's the
  case, the call is immediately unwinded with that error. The error seems
  to be set in wb_fulfill_cbk(). I don't know the internals of write-back
  xlator, but this seems to be the problem.
 
  Yes, your analysis is correct. Once the error is hit, fsync is not
  queued  behind unfulfilled writes. Whether it can be considered as a bug
  is debatable.  Since there is already an error in one of the writes which
  was written-behind  fsync should return the error. I am not sure whether
  it should wait till we try to flush _all_ the writes that were written
  behind. Any suggestions on what is the expected behaviour here?
 
 
 I think that it should wait for all pending writes. In the test case I
 used, all pending writes will fail the same way that the first one, but
 in other situations it's possible to have a write failing (for example
 due to a damaged block in disk) and following writes succeeding.
 
  From the man page of fsync:
 
  fsync() transfers (flushes) all modified in-core data of (i.e.,
  modified buffer cache pages for) the file referred to by the file
  descriptor fd to the disk device (or other permanent storage
  device) so that all changed information can be retrieved even after
  the system crashed or was rebooted. This includes writing through
  or flushing a disk cache if present. The call blocks until the
  device reports that the transfer has completed. It also flushes
  metadata information associated with the file (see stat(2)).
 
 As I understand it, when fsync is received all queued writes must be
 sent to the device (regardless if a previous write has failed or not).
 It also says that the call blocks until the device has finished all the
 operations.
 
 However it's not clear to me how to control file consistency because
 this allows some writes to succeed after a failed one. 

Though fsync doesn't wait on queued writes after a failure, the queued writes 
are flushed to disk even in the existing codebase. Can you file a bug to make 
fsync to wait for completion of queued writes irrespective of whether flushing 
any of them failed or not? I'll send a patch to fix the issue. Just to 
prioritise this, how important is the fix?

 I assume that
 controlling this is the responsibility of the calling application that
 should issue fsyncs on critical points to guarantee consistency.
 
 Anyway it seems that there's a difference between linux and NetBSD
 because this test only fails on NetBSD. Is it possible that linux's fuse
 implementation delays the fsync request until all pending writes have
 been answered ? this would explain why this problem has not manifested
 till now. NetBSD seems to send fsync (probably as the first step of a
 close() call) when the first write fails.
 
 Xavi
 
___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://supercolony.gluster.org/mailman/listinfo/gluster-devel


Re: [Gluster-devel] Wrong behavior on fsync of md-cache ?

2014-11-25 Thread Xavier Hernandez

On 11/25/2014 12:59 PM, Raghavendra Gowdappa wrote:



- Original Message -

From: Xavier Hernandez xhernan...@datalab.es
To: Raghavendra Gowdappa rgowd...@redhat.com
Cc: Gluster Devel gluster-devel@gluster.org, Emmanuel Dreyfus 
m...@netbsd.org
Sent: Tuesday, November 25, 2014 2:05:25 PM
Subject: Re: Wrong behavior on fsync of md-cache ?

On 11/25/2014 07:38 AM, Raghavendra Gowdappa wrote:

- Original Message -

From: Xavier Hernandez xhernan...@datalab.es
To: Raghavendra Gowdappa rgowd...@redhat.com
Cc: Gluster Devel gluster-devel@gluster.org, Emmanuel Dreyfus
m...@netbsd.org
Sent: Tuesday, November 25, 2014 12:49:03 AM
Subject: Re: Wrong behavior on fsync of md-cache ?

I think the problem is here: the first thing wb_fsync()
checks is if there's an error in the fd (wd_fd_err()). If that's the
case, the call is immediately unwinded with that error. The error seems
to be set in wb_fulfill_cbk(). I don't know the internals of write-back
xlator, but this seems to be the problem.


Yes, your analysis is correct. Once the error is hit, fsync is not
queued  behind unfulfilled writes. Whether it can be considered as a bug
is debatable.  Since there is already an error in one of the writes which
was written-behind  fsync should return the error. I am not sure whether
it should wait till we try to flush _all_ the writes that were written
behind. Any suggestions on what is the expected behaviour here?



I think that it should wait for all pending writes. In the test case I
used, all pending writes will fail the same way that the first one, but
in other situations it's possible to have a write failing (for example
due to a damaged block in disk) and following writes succeeding.

  From the man page of fsync:

  fsync() transfers (flushes) all modified in-core data of (i.e.,
  modified buffer cache pages for) the file referred to by the file
  descriptor fd to the disk device (or other permanent storage
  device) so that all changed information can be retrieved even after
  the system crashed or was rebooted. This includes writing through
  or flushing a disk cache if present. The call blocks until the
  device reports that the transfer has completed. It also flushes
  metadata information associated with the file (see stat(2)).

As I understand it, when fsync is received all queued writes must be
sent to the device (regardless if a previous write has failed or not).
It also says that the call blocks until the device has finished all the
operations.

However it's not clear to me how to control file consistency because
this allows some writes to succeed after a failed one.


Though fsync doesn't wait on queued writes after a failure, the queued writes 
are flushed to disk even in the existing codebase. Can you file a bug to make 
fsync to wait for completion of queued writes irrespective of whether flushing 
any of them failed or not? I'll send a patch to fix the issue.


I filed bug #1167793


Just to prioritise this, how important is the fix?


It seems to fail only in NetBSD. I'm not sure what priority it has. 
Emmanuel is trying to create a regression test for new patches that 
checks all tests in tests/basic, and tests/basic/ec/quota.t hits this issue.


An alternative would be to temporarily remove or change this test to 
avoid the problem.


Xavi
___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://supercolony.gluster.org/mailman/listinfo/gluster-devel


Re: [Gluster-devel] Wrong behavior on fsync of md-cache ?

2014-11-25 Thread Emmanuel Dreyfus
Xavier Hernandez xhernan...@datalab.es wrote:

 An alternative would be to temporarily remove or change this test to 
 avoid the problem.

That would help on that test, but I suspect the same problem is
responsible for other spurious failures.

-- 
Emmanuel Dreyfus
http://hcpnet.free.fr/pubz
m...@netbsd.org
___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://supercolony.gluster.org/mailman/listinfo/gluster-devel


Re: [Gluster-devel] EHT / DHT

2014-11-25 Thread Jan H Holtzhausen
Are you referring to something else in your request? Meaning, you want 
/myfile, /dir1/myfile and /dir2/dir3/myfile to fall onto the same 
 bricks/subvolumes and that perchance is what you are looking for?

That is EXACTLY what I am looking for.
What are my chances?

BR
Jan

___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://supercolony.gluster.org/mailman/listinfo/gluster-devel


Re: [Gluster-devel] EHT / DHT

2014-11-25 Thread Jan H Holtzhausen
I think I have it.
Unless I’m totally confused, I can hash ONLY on the filename with:
glusterfs --volfile-server=a_server --volfile-id=a_volume \
   --xlator-option a_volume-dht.extra_hash_regex='.*[/]' \
   /a/mountpoint
Correct?

Jan

From:  Jan H Holtzhausen j...@holtztech.info
Date:  Tuesday 25 November 2014 at 9:06 PM
To:  gluster-devel@gluster.org
Subject:  Re: [Gluster-devel] EHT / DHT

Are you referring to something else in your request? Meaning, you want 
/myfile, /dir1/myfile and /dir2/dir3/myfile to fall onto the same 
 bricks/subvolumes and that perchance is what you are looking for?

That is EXACTLY what I am looking for.
What are my chances?

BR
Jan
___ Gluster-devel mailing list 
Gluster-devel@gluster.org 
http://supercolony.gluster.org/mailman/listinfo/gluster-devel
___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://supercolony.gluster.org/mailman/listinfo/gluster-devel


Re: [Gluster-devel] EHT / DHT

2014-11-25 Thread Shyam

On 11/25/2014 02:28 PM, Jan H Holtzhausen wrote:

I think I have it.
Unless I’m totally confused, I can hash ONLY on the filename with:

glusterfs --volfile-server=a_server --volfile-id=a_volume \
--xlator-option a_volume-dht.extra_hash_regex='.*[/]' \
/a/mountpoint

Correct?


The hash of a file does not include the full path, it is on the file 
name _only_. So any regex will not work when the filename remains 
constant like myfile.


As Jeff explains the option is really to prevent using temporary parts 
of the name in the hash computation (for rename optimization). In this 
case, you do not seem to have any tmp parts to the name, like myfile 
and myfile~ should evaluate to the same hash, so remove all trailing 
'~' from the name.


So I am not sure the above is the option you are looking for.



Jan

From: Jan H Holtzhausen j...@holtztech.info mailto:j...@holtztech.info
Date: Tuesday 25 November 2014 at 9:06 PM
To: gluster-devel@gluster.org mailto:gluster-devel@gluster.org
Subject: Re: [Gluster-devel] EHT / DHT


Are you referring to something else in your request? Meaning, you want



/myfile, /dir1/myfile and /dir2/dir3/myfile to fall onto the same



bricks/subvolumes and that perchance is what you are looking for?



That is EXACTLY whatI  am looking for.

What are my chances?


As far as I know not much out of the box. As Jeff explained, the 
directory distribution/layout considers the GFID of the directory, hence 
each of the directories in the above example would/could get different 
ranges.


The file on the other hand remains constant myfile so its hash value 
remains the same, but due to the distribution range change as above for 
the directories, it will land on different bricks and not the same one.


Out of curiosity, why is this functionality needed?

Shyam
___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://supercolony.gluster.org/mailman/listinfo/gluster-devel


Re: [Gluster-devel] EHT / DHT

2014-11-25 Thread Jan H Holtzhausen
Hmm
Then something is wrong, 
If I upload 2 identical files, with different paths they only end up on 
the same server 1/4 of the time (I have 4 bricks).
I’ll test the regex quickly.

BR
Jan




On 2014/11/25, 7:55 PM, Shyam srang...@redhat.com wrote:

On 11/25/2014 02:28 PM, Jan H Holtzhausen wrote:
 I think I have it.
 Unless I’m totally confused, I can hash ONLY on the filename with:

 glusterfs --volfile-server=a_server --volfile-id=a_volume \
 --xlator-option a_volume-dht.extra_hash_regex='.*[/]' \
 /a/mountpoint

 Correct?

The hash of a file does not include the full path, it is on the file 
name _only_. So any regex will not work when the filename remains 
constant like myfile.

As Jeff explains the option is really to prevent using temporary parts 
of the name in the hash computation (for rename optimization). In this 
case, you do not seem to have any tmp parts to the name, like myfile 
and myfile~ should evaluate to the same hash, so remove all trailing 
'~' from the name.

So I am not sure the above is the option you are looking for.


 Jan

 From: Jan H Holtzhausen j...@holtztech.info 
mailto:j...@holtztech.info
 Date: Tuesday 25 November 2014 at 9:06 PM
 To: gluster-devel@gluster.org mailto:gluster-devel@gluster.org
 Subject: Re: [Gluster-devel] EHT / DHT

Are you referring to something else in your request? Meaning, you want

/myfile, /dir1/myfile and /dir2/dir3/myfile to fall onto the same

 bricks/subvolumes and that perchance is what you are looking for?


 That is EXACTLY whatI  am looking for.

 What are my chances?

As far as I know not much out of the box. As Jeff explained, the 
directory distribution/layout considers the GFID of the directory, hence 
each of the directories in the above example would/could get different 
ranges.

The file on the other hand remains constant myfile so its hash value 
remains the same, but due to the distribution range change as above for 
the directories, it will land on different bricks and not the same one.

Out of curiosity, why is this functionality needed?

Shyam
___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://supercolony.gluster.org/mailman/listinfo/gluster-devel

___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://supercolony.gluster.org/mailman/listinfo/gluster-devel


Re: [Gluster-devel] EHT / DHT

2014-11-25 Thread Shyam

On 11/25/2014 03:11 PM, Jan H Holtzhausen wrote:

STILL doesn’t work … exact same file ends up on 2 different bricks …
I must be missing something.
All I need is for:
/directory1/subdirectory2/foo
And
/directory2/subdirectoryaaa999/foo


To end up on the same brick….


This is not possible is what I was attempting to state in the previous 
mail. The regex filter is not for this purpose.


The hash is always based on the name of the file, but the location is 
based on the distribution/layout of the directory, which is different 
for each directory based on its GFID.


So there are no options in the code to enable what you seek at present.

Why is this needed?

Shyam
___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://supercolony.gluster.org/mailman/listinfo/gluster-devel


Re: [Gluster-devel] EHT / DHT

2014-11-25 Thread Jan H Holtzhausen
So in a distributed cluster, the GFID tells all bricks what a files 
preceding directory structure looks like?
Where the physical file is saved is a function of the filename ONLY.
Therefore My requirement should be met by default, or am I being dense?

BR
Jan



On 2014/11/25, 8:15 PM, Shyam srang...@redhat.com wrote:

On 11/25/2014 03:11 PM, Jan H Holtzhausen wrote:
 STILL doesn’t work … exact same file ends up on 2 different bricks …
 I must be missing something.
 All I need is for:
 /directory1/subdirectory2/foo
 And
 /directory2/subdirectoryaaa999/foo


 To end up on the same brick….

This is not possible is what I was attempting to state in the previous 
mail. The regex filter is not for this purpose.

The hash is always based on the name of the file, but the location is 
based on the distribution/layout of the directory, which is different 
for each directory based on its GFID.

So there are no options in the code to enable what you seek at present.

Why is this needed?

Shyam

___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://supercolony.gluster.org/mailman/listinfo/gluster-devel


Re: [Gluster-devel] EHT / DHT

2014-11-25 Thread Jan H Holtzhausen
As to the why.
Filesystem cache hits.
Files with the same name tend to be the same files.

Regards
Jan




On 2014/11/25, 8:42 PM, Jan H Holtzhausen j...@holtztech.info wrote:

So in a distributed cluster, the GFID tells all bricks what a files 
preceding directory structure looks like?
Where the physical file is saved is a function of the filename ONLY.
Therefore My requirement should be met by default, or am I being dense?

BR
Jan



On 2014/11/25, 8:15 PM, Shyam srang...@redhat.com wrote:

On 11/25/2014 03:11 PM, Jan H Holtzhausen wrote:
 STILL doesn’t work … exact same file ends up on 2 different bricks …
 I must be missing something.
 All I need is for:
 /directory1/subdirectory2/foo
 And
 /directory2/subdirectoryaaa999/foo


 To end up on the same brick….

This is not possible is what I was attempting to state in the previous 
mail. The regex filter is not for this purpose.

The hash is always based on the name of the file, but the location is 
based on the distribution/layout of the directory, which is different 
for each directory based on its GFID.

So there are no options in the code to enable what you seek at present.

Why is this needed?

Shyam

___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://supercolony.gluster.org/mailman/listinfo/gluster-devel

___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://supercolony.gluster.org/mailman/listinfo/gluster-devel


Re: [Gluster-devel] Single layout at root (Was EHT / DHT)

2014-11-25 Thread Anand Avati
On Tue Nov 25 2014 at 1:28:59 PM Shyam srang...@redhat.com wrote:

 On 11/12/2014 01:55 AM, Anand Avati wrote:
 
 
  On Tue, Nov 11, 2014 at 1:56 PM, Jeff Darcy jda...@redhat.com
  mailto:jda...@redhat.com wrote:
 
(Personally I would have
  done this by mixing in the parent GFID to the hash calculation, but
  that alternative was ignored.)
 
 
  Actually when DHT was implemented, the concept of GFID did not (yet)
  exist. Due to backward compatibility it has just remained this way even
  later. Including the GFID into the hash has benefits.

 I am curious here as this is interesting.

 So the layout start subvol assignment for a directory to be based on its
 GFID was provided so that files with the same name distribute better
 than ending up in the same bricks, right?


Right, for e.g we wouldn't want all the README.txt in various directories
of a volume to end up on the same server. The way it is achieved today is,
the per server hash-range assignment is rotated by a certain amount (how
much it is rotated is determined by a separate hash on the directory path)
at the time of mkdir.


 Instead as we _now_ have GFID, we could use that including the name to
 get a similar/better distribution, or GFID+name to determine hashed subvol.


What we could do now is, include the parent directory gfid as an input into
the DHT hash function.

Today, we do approximately:
  int hashval = dm_hash (readme.txt)
  hash_ranges[] = inode_ctx_get (parent_dir)
  subvol = find_subvol (hash_ranges, hashval)

Instead, we could:
  int hashval = new_hash (readme.txt, parent_dir.gfid)
  hash_ranges[] = global_value
  subvol = find_subvol (hash_ranges, hashval)

The idea here would be that on dentry creates we would need to generate
 the GFID and not let the bricks generate the same, so that we can choose
 the subvol to wind the FOP to.


The GFID would be that of the parent (as an entry name is always in the
context of a parent directory/inode). Also, the GFID for a new entry is
already generated by the client, the brick does not generate a GFID.


 This eliminates the need for a layout per sub-directory and all the
 (interesting) problems that it comes with and instead can be replaced by
 a layout at root. Not sure if it handles all use cases and paths that we
 have now (which needs more understanding).

 I do understand there is a backward compatibility issue here, but other
 than this, this sounds better than the current scheme, as there is a
 single layout to read/optimize/stash/etc. across clients.

 Can I understand the rationale of this better, as to what you folks are
 thinking. Am I missing something or over reading on the benefits that
 this can provide?


I think you understand it right. The benefit is one could have a single
hash layout for the entire volume and the directory specific-ness is
implemented by including the directory gfid into the hash function. The way
I see it, the compromise would be something like:

Pro per directory range: By having per-directory hash ranges, we can do
easier incremental rebalance. Partial progress is well tolerated and does
not impact the entire volume. The time a given directory is undergoing
rebalance, for that directory alone we need to enter unhashed lookup
mode, only for that period of time.

Con per directory range: Just the new hash assignment phase (to impact
placement of new files/data, not move old data) itself is an extended
process, crawling the entire volume with complex per-directory operations.
The number of points in the system where things can break (i.e, result in
overlaps and holes in ranges) is high.

Pro single layout with dir GFID in hash: Avoid the numerous parts (per-dir
hash ranges) which can potentially break.

Con single layout with dir GFID in hash: Rebalance phase 1 (assigning new
layout) is atomic for the entire volume - unhashed lookup has to be on
for all dirs for the entire period. To mitigate this, we could explore
versioning the centralized hash ranges, and store the version used by each
directory in its xattrs (and update the version as the rebalance
progresses). But now we have more centralized metadata (may be/ may not be
a worthy compromise - not sure.)

In summary, including GFID into the hash calculation does open up
interesting possibilities and worthy of serious consideration.

HTH,
Avati
___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://supercolony.gluster.org/mailman/listinfo/gluster-devel


Re: [Gluster-devel] Single layout at root (Was EHT / DHT)

2014-11-25 Thread Shyam

On 11/25/2014 05:03 PM, Anand Avati wrote:



On Tue Nov 25 2014 at 1:28:59 PM Shyam srang...@redhat.com
mailto:srang...@redhat.com wrote:

On 11/12/2014 01:55 AM, Anand Avati wrote:
 
 
  On Tue, Nov 11, 2014 at 1:56 PM, Jeff Darcy jda...@redhat.com
mailto:jda...@redhat.com
  mailto:jda...@redhat.com mailto:jda...@redhat.com wrote:
 
(Personally I would have
  done this by mixing in the parent GFID to the hash
calculation, but
  that alternative was ignored.)
 
 
  Actually when DHT was implemented, the concept of GFID did not (yet)
  exist. Due to backward compatibility it has just remained this
way even
  later. Including the GFID into the hash has benefits.

I am curious here as this is interesting.

So the layout start subvol assignment for a directory to be based on its
GFID was provided so that files with the same name distribute better
than ending up in the same bricks, right?


Right, for e.g we wouldn't want all the README.txt in various
directories of a volume to end up on the same server. The way it is
achieved today is, the per server hash-range assignment is rotated by
a certain amount (how much it is rotated is determined by a separate
hash on the directory path) at the time of mkdir.

Instead as we _now_ have GFID, we could use that including the name to
get a similar/better distribution, or GFID+name to determine hashed
subvol.

What we could do now is, include the parent directory gfid as an input
into the DHT hash function.

Today, we do approximately:
   int hashval = dm_hash (readme.txt)
   hash_ranges[] = inode_ctx_get (parent_dir)
   subvol = find_subvol (hash_ranges, hashval)

Instead, we could:
   int hashval = new_hash (readme.txt, parent_dir.gfid)
   hash_ranges[] = global_value
   subvol = find_subvol (hash_ranges, hashval)

The idea here would be that on dentry creates we would need to generate
the GFID and not let the bricks generate the same, so that we can choose
the subvol to wind the FOP to.


The GFID would be that of the parent (as an entry name is always in the
context of a parent directory/inode). Also, the GFID for a new entry is
already generated by the client, the brick does not generate a GFID.

This eliminates the need for a layout per sub-directory and all the
(interesting) problems that it comes with and instead can be replaced by
a layout at root. Not sure if it handles all use cases and paths that we
have now (which needs more understanding).

I do understand there is a backward compatibility issue here, but other
than this, this sounds better than the current scheme, as there is a
single layout to read/optimize/stash/etc. across clients.

Can I understand the rationale of this better, as to what you folks are
thinking. Am I missing something or over reading on the benefits that
this can provide?


I think you understand it right. The benefit is one could have a single
hash layout for the entire volume and the directory specific-ness is
implemented by including the directory gfid into the hash function. The
way I see it, the compromise would be something like:

Pro per directory range: By having per-directory hash ranges, we can do
easier incremental rebalance. Partial progress is well tolerated and
does not impact the entire volume. The time a given directory is
undergoing rebalance, for that directory alone we need to enter
unhashed lookup mode, only for that period of time.

Con per directory range: Just the new hash assignment phase (to impact
placement of new files/data, not move old data) itself is an extended
process, crawling the entire volume with complex per-directory
operations. The number of points in the system where things can break
(i.e, result in overlaps and holes in ranges) is high.

Pro single layout with dir GFID in hash: Avoid the numerous parts
(per-dir hash ranges) which can potentially break.

Con single layout with dir GFID in hash: Rebalance phase 1 (assigning
new layout) is atomic for the entire volume - unhashed lookup has to be
on for all dirs for the entire period. To mitigate this, we could
explore versioning the centralized hash ranges, and store the version
used by each directory in its xattrs (and update the version as the
rebalance progresses). But now we have more centralized metadata (may
be/ may not be a worthy compromise - not sure.)


Agreed, the auto-unhased would have to wait longer before being rearmed.

Just throwing some more thoughts on the same,

Unhashed-auto also can benefit from just linkto creations, rather than 
require a data rebalance (i.e movement of data). So in phase-0 we could 
just create the linkto files and then turn on auto-unhashed. As lookups 
would find the (linkto) file.


Other abilities, like giving directories weighted layout ranges based on 
size of bricks could be affected, i.e forcing a rebalance when a brick 
size is 

Re: [Gluster-devel] EHT / DHT

2014-11-25 Thread Poornima Gurusiddaiah
Out of curiosity, what back end and deduplication solution are you using? 

Regards, 
Poornima 

- Original Message -

From: Jan H Holtzhausen j...@holtztech.info 
To: Anand Avati av...@gluster.org, Shyam srang...@redhat.com, 
gluster-devel@gluster.org 
Sent: Wednesday, November 26, 2014 3:43:36 AM 
Subject: Re: [Gluster-devel] EHT / DHT 

Yes we have deduplication at the filesystem layer 

BR 
Jan 

From: Anand Avati  av...@gluster.org  
Date: Wednesday 26 November 2014 at 12:11 AM 
To: Jan H Holtzhausen  j...@holtztech.info , Shyam  srang...@redhat.com ,  
gluster-devel@gluster.org  
Subject: Re: [Gluster-devel] EHT / DHT 

Unless there is some sort of de-duplication under the covers happening in the 
brick, or the files are hardlinks to each other, there is no cache benefit 
whatsoever by having identical files placed on the same server. 

Thanks, 
Avati 

On Tue Nov 25 2014 at 12:59:25 PM Jan H Holtzhausen  j...@holtztech.info  
wrote: 


As to the why. 
Filesystem cache hits. 
Files with the same name tend to be the same files. 

Regards 
Jan 




On 2014/11/25, 8:42 PM, Jan H Holtzhausen  j...@holtztech.info  wrote: 

So in a distributed cluster, the GFID tells all bricks what a files 
preceding directory structure looks like? 
Where the physical file is saved is a function of the filename ONLY. 
Therefore My requirement should be met by default, or am I being dense? 
 
BR 
Jan 
 
 
 
On 2014/11/25, 8:15 PM, Shyam  srang...@redhat.com  wrote: 
 
On 11/25/2014 03:11 PM, Jan H Holtzhausen wrote: 
 STILL doesn’t work … exact same file ends up on 2 different bricks … 
 I must be missing something. 
 All I need is for: 
 /directory1/subdirectory2/foo 
 And 
 /directory2/ subdirectoryaaa999/foo 
 
 
 To end up on the same brick…. 
 
This is not possible is what I was attempting to state in the previous 
mail. The regex filter is not for this purpose. 
 
The hash is always based on the name of the file, but the location is 
based on the distribution/layout of the directory, which is different 
for each directory based on its GFID. 
 
So there are no options in the code to enable what you seek at present. 
 
Why is this needed? 
 
Shyam 
 
_ __ 
Gluster-devel mailing list 
 Gluster-devel@gluster.org 
 http://supercolony.gluster. org/mailman/listinfo/gluster- devel 

__ _ 
Gluster-devel mailing list 
Gluster-devel@gluster.org 
http://supercolony.gluster. org/mailman/listinfo/gluster- devel 




___ 
Gluster-devel mailing list 
Gluster-devel@gluster.org 
http://supercolony.gluster.org/mailman/listinfo/gluster-devel 

___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://supercolony.gluster.org/mailman/listinfo/gluster-devel


Re: [Gluster-devel] EHT / DHT

2014-11-25 Thread Jan H Holtzhausen
I could tell you… 
But Symantec wouldn’t like it…..

From:  Poornima Gurusiddaiah pguru...@redhat.com
Date:  Wednesday 26 November 2014 at 7:16 AM
To:  Jan H Holtzhausen j...@holtztech.info
Cc:  gluster-devel@gluster.org
Subject:  Re: [Gluster-devel] EHT / DHT

Out of curiosity, what back end and deduplication solution are you using?

Regards,
Poornima

From: Jan H Holtzhausen j...@holtztech.info
To: Anand Avati av...@gluster.org, Shyam srang...@redhat.com, 
gluster-devel@gluster.org
Sent: Wednesday, November 26, 2014 3:43:36 AM
Subject: Re: [Gluster-devel] EHT / DHT

Yes we have deduplication at the filesystem layer

BR
Jan

From:  Anand Avati av...@gluster.org
Date:  Wednesday 26 November 2014 at 12:11 AM
To:  Jan H Holtzhausen j...@holtztech.info, Shyam srang...@redhat.com, 
gluster-devel@gluster.org
Subject:  Re: [Gluster-devel] EHT / DHT

Unless there is some sort of de-duplication under the covers happening in 
the brick, or the files are hardlinks to each other, there is no cache 
benefit whatsoever by having identical files placed on the same server.

Thanks,
Avati

On Tue Nov 25 2014 at 12:59:25 PM Jan H Holtzhausen j...@holtztech.info 
wrote:
As to the why.
Filesystem cache hits.
Files with the same name tend to be the same files.

Regards
Jan




On 2014/11/25, 8:42 PM, Jan H Holtzhausen j...@holtztech.info wrote:

So in a distributed cluster, the GFID tells all bricks what a files
preceding directory structure looks like?
Where the physical file is saved is a function of the filename ONLY.
Therefore My requirement should be met by default, or am I being dense?

BR
Jan



On 2014/11/25, 8:15 PM, Shyam srang...@redhat.com wrote:

On 11/25/2014 03:11 PM, Jan H Holtzhausen wrote:
 STILL doesn’t work … exact same file ends up on 2 different bricks …
 I must be missing something.
 All I need is for:
 /directory1/subdirectory2/foo
 And
 /directory2/subdirectoryaaa999/foo


 To end up on the same brick….

This is not possible is what I was attempting to state in the previous
mail. The regex filter is not for this purpose.

The hash is always based on the name of the file, but the location is
based on the distribution/layout of the directory, which is different
for each directory based on its GFID.

So there are no options in the code to enable what you seek at present.

Why is this needed?

Shyam

___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://supercolony.gluster.org/mailman/listinfo/gluster-devel

___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://supercolony.gluster.org/mailman/listinfo/gluster-devel

___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://supercolony.gluster.org/mailman/listinfo/gluster-devel


___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://supercolony.gluster.org/mailman/listinfo/gluster-devel