[Gluster-devel] REMINDER: Gluster Community Bug Triage meeting today at 12:00 UTC
Hi all, Later today we will have an other Gluster Community Bug Triage meeting. Meeting details: - location: #gluster-meeting on Freenode IRC - date: every Tuesday - time: 12:00 UTC, 13:00 CET (in your terminal, run: date -d 12:00 UTC) - agenda: https://public.pad.fsfe.org/p/gluster-bug-triage Currently the following items are listed: * Roll Call * Status of last weeks action items * Group Triage * Open Floor The last two topics have space for additions. If you have a suitable bug or topic to discuss, please add it to the agenda. Your host today is LalatenduM. I'm unfortunately not avaialble this/my afternoon. Thanks, Niels pgpxtBs_NC6ek.pgp Description: PGP signature ___ Gluster-devel mailing list Gluster-devel@gluster.org http://supercolony.gluster.org/mailman/listinfo/gluster-devel
Re: [Gluster-devel] Wrong behavior on fsync of md-cache ?
On 11/25/2014 07:38 AM, Raghavendra Gowdappa wrote: - Original Message - From: Xavier Hernandez xhernan...@datalab.es To: Raghavendra Gowdappa rgowd...@redhat.com Cc: Gluster Devel gluster-devel@gluster.org, Emmanuel Dreyfus m...@netbsd.org Sent: Tuesday, November 25, 2014 12:49:03 AM Subject: Re: Wrong behavior on fsync of md-cache ? I think the problem is here: the first thing wb_fsync() checks is if there's an error in the fd (wd_fd_err()). If that's the case, the call is immediately unwinded with that error. The error seems to be set in wb_fulfill_cbk(). I don't know the internals of write-back xlator, but this seems to be the problem. Yes, your analysis is correct. Once the error is hit, fsync is not queued behind unfulfilled writes. Whether it can be considered as a bug is debatable. Since there is already an error in one of the writes which was written-behind fsync should return the error. I am not sure whether it should wait till we try to flush _all_ the writes that were written behind. Any suggestions on what is the expected behaviour here? I think that it should wait for all pending writes. In the test case I used, all pending writes will fail the same way that the first one, but in other situations it's possible to have a write failing (for example due to a damaged block in disk) and following writes succeeding. From the man page of fsync: fsync() transfers (flushes) all modified in-core data of (i.e., modified buffer cache pages for) the file referred to by the file descriptor fd to the disk device (or other permanent storage device) so that all changed information can be retrieved even after the system crashed or was rebooted. This includes writing through or flushing a disk cache if present. The call blocks until the device reports that the transfer has completed. It also flushes metadata information associated with the file (see stat(2)). As I understand it, when fsync is received all queued writes must be sent to the device (regardless if a previous write has failed or not). It also says that the call blocks until the device has finished all the operations. However it's not clear to me how to control file consistency because this allows some writes to succeed after a failed one. I assume that controlling this is the responsibility of the calling application that should issue fsyncs on critical points to guarantee consistency. Anyway it seems that there's a difference between linux and NetBSD because this test only fails on NetBSD. Is it possible that linux's fuse implementation delays the fsync request until all pending writes have been answered ? this would explain why this problem has not manifested till now. NetBSD seems to send fsync (probably as the first step of a close() call) when the first write fails. Xavi ___ Gluster-devel mailing list Gluster-devel@gluster.org http://supercolony.gluster.org/mailman/listinfo/gluster-devel
Re: [Gluster-devel] Wrong behavior on fsync of md-cache ?
On Tue, Nov 25, 2014 at 09:35:25AM +0100, Xavier Hernandez wrote: Anyway it seems that there's a difference between linux and NetBSD because this test only fails on NetBSD. Is it possible that linux's fuse implementation delays the fsync request until all pending writes have been answered ? this would explain why this problem has not manifested till now. NetBSD seems to send fsync (probably as the first step of a close() call) when the first write fails. I confirm that NetBSD FUSE sends a fsync before dropping the last reference on the vnode. That happens on close and it means the last close will wait for data to be sync on disk. At that time there can be pending writes because of page cache flush which is done asynchrnously: write system calls returns after storing data in page cache, and cache is flushed to the filesystem later. The kernel also flush page cache and sends fsyncs at regular time and data written interval. -- Emmanuel Dreyfus m...@netbsd.org ___ Gluster-devel mailing list Gluster-devel@gluster.org http://supercolony.gluster.org/mailman/listinfo/gluster-devel
Re: [Gluster-devel] Wrong behavior on fsync of md-cache ?
- Original Message - From: Xavier Hernandez xhernan...@datalab.es To: Raghavendra Gowdappa rgowd...@redhat.com Cc: Gluster Devel gluster-devel@gluster.org, Emmanuel Dreyfus m...@netbsd.org Sent: Tuesday, November 25, 2014 2:05:25 PM Subject: Re: Wrong behavior on fsync of md-cache ? On 11/25/2014 07:38 AM, Raghavendra Gowdappa wrote: - Original Message - From: Xavier Hernandez xhernan...@datalab.es To: Raghavendra Gowdappa rgowd...@redhat.com Cc: Gluster Devel gluster-devel@gluster.org, Emmanuel Dreyfus m...@netbsd.org Sent: Tuesday, November 25, 2014 12:49:03 AM Subject: Re: Wrong behavior on fsync of md-cache ? I think the problem is here: the first thing wb_fsync() checks is if there's an error in the fd (wd_fd_err()). If that's the case, the call is immediately unwinded with that error. The error seems to be set in wb_fulfill_cbk(). I don't know the internals of write-back xlator, but this seems to be the problem. Yes, your analysis is correct. Once the error is hit, fsync is not queued behind unfulfilled writes. Whether it can be considered as a bug is debatable. Since there is already an error in one of the writes which was written-behind fsync should return the error. I am not sure whether it should wait till we try to flush _all_ the writes that were written behind. Any suggestions on what is the expected behaviour here? I think that it should wait for all pending writes. In the test case I used, all pending writes will fail the same way that the first one, but in other situations it's possible to have a write failing (for example due to a damaged block in disk) and following writes succeeding. From the man page of fsync: fsync() transfers (flushes) all modified in-core data of (i.e., modified buffer cache pages for) the file referred to by the file descriptor fd to the disk device (or other permanent storage device) so that all changed information can be retrieved even after the system crashed or was rebooted. This includes writing through or flushing a disk cache if present. The call blocks until the device reports that the transfer has completed. It also flushes metadata information associated with the file (see stat(2)). As I understand it, when fsync is received all queued writes must be sent to the device (regardless if a previous write has failed or not). It also says that the call blocks until the device has finished all the operations. However it's not clear to me how to control file consistency because this allows some writes to succeed after a failed one. Though fsync doesn't wait on queued writes after a failure, the queued writes are flushed to disk even in the existing codebase. Can you file a bug to make fsync to wait for completion of queued writes irrespective of whether flushing any of them failed or not? I'll send a patch to fix the issue. Just to prioritise this, how important is the fix? I assume that controlling this is the responsibility of the calling application that should issue fsyncs on critical points to guarantee consistency. Anyway it seems that there's a difference between linux and NetBSD because this test only fails on NetBSD. Is it possible that linux's fuse implementation delays the fsync request until all pending writes have been answered ? this would explain why this problem has not manifested till now. NetBSD seems to send fsync (probably as the first step of a close() call) when the first write fails. Xavi ___ Gluster-devel mailing list Gluster-devel@gluster.org http://supercolony.gluster.org/mailman/listinfo/gluster-devel
Re: [Gluster-devel] Wrong behavior on fsync of md-cache ?
On 11/25/2014 12:59 PM, Raghavendra Gowdappa wrote: - Original Message - From: Xavier Hernandez xhernan...@datalab.es To: Raghavendra Gowdappa rgowd...@redhat.com Cc: Gluster Devel gluster-devel@gluster.org, Emmanuel Dreyfus m...@netbsd.org Sent: Tuesday, November 25, 2014 2:05:25 PM Subject: Re: Wrong behavior on fsync of md-cache ? On 11/25/2014 07:38 AM, Raghavendra Gowdappa wrote: - Original Message - From: Xavier Hernandez xhernan...@datalab.es To: Raghavendra Gowdappa rgowd...@redhat.com Cc: Gluster Devel gluster-devel@gluster.org, Emmanuel Dreyfus m...@netbsd.org Sent: Tuesday, November 25, 2014 12:49:03 AM Subject: Re: Wrong behavior on fsync of md-cache ? I think the problem is here: the first thing wb_fsync() checks is if there's an error in the fd (wd_fd_err()). If that's the case, the call is immediately unwinded with that error. The error seems to be set in wb_fulfill_cbk(). I don't know the internals of write-back xlator, but this seems to be the problem. Yes, your analysis is correct. Once the error is hit, fsync is not queued behind unfulfilled writes. Whether it can be considered as a bug is debatable. Since there is already an error in one of the writes which was written-behind fsync should return the error. I am not sure whether it should wait till we try to flush _all_ the writes that were written behind. Any suggestions on what is the expected behaviour here? I think that it should wait for all pending writes. In the test case I used, all pending writes will fail the same way that the first one, but in other situations it's possible to have a write failing (for example due to a damaged block in disk) and following writes succeeding. From the man page of fsync: fsync() transfers (flushes) all modified in-core data of (i.e., modified buffer cache pages for) the file referred to by the file descriptor fd to the disk device (or other permanent storage device) so that all changed information can be retrieved even after the system crashed or was rebooted. This includes writing through or flushing a disk cache if present. The call blocks until the device reports that the transfer has completed. It also flushes metadata information associated with the file (see stat(2)). As I understand it, when fsync is received all queued writes must be sent to the device (regardless if a previous write has failed or not). It also says that the call blocks until the device has finished all the operations. However it's not clear to me how to control file consistency because this allows some writes to succeed after a failed one. Though fsync doesn't wait on queued writes after a failure, the queued writes are flushed to disk even in the existing codebase. Can you file a bug to make fsync to wait for completion of queued writes irrespective of whether flushing any of them failed or not? I'll send a patch to fix the issue. I filed bug #1167793 Just to prioritise this, how important is the fix? It seems to fail only in NetBSD. I'm not sure what priority it has. Emmanuel is trying to create a regression test for new patches that checks all tests in tests/basic, and tests/basic/ec/quota.t hits this issue. An alternative would be to temporarily remove or change this test to avoid the problem. Xavi ___ Gluster-devel mailing list Gluster-devel@gluster.org http://supercolony.gluster.org/mailman/listinfo/gluster-devel
Re: [Gluster-devel] Wrong behavior on fsync of md-cache ?
Xavier Hernandez xhernan...@datalab.es wrote: An alternative would be to temporarily remove or change this test to avoid the problem. That would help on that test, but I suspect the same problem is responsible for other spurious failures. -- Emmanuel Dreyfus http://hcpnet.free.fr/pubz m...@netbsd.org ___ Gluster-devel mailing list Gluster-devel@gluster.org http://supercolony.gluster.org/mailman/listinfo/gluster-devel
Re: [Gluster-devel] EHT / DHT
Are you referring to something else in your request? Meaning, you want /myfile, /dir1/myfile and /dir2/dir3/myfile to fall onto the same bricks/subvolumes and that perchance is what you are looking for? That is EXACTLY what I am looking for. What are my chances? BR Jan ___ Gluster-devel mailing list Gluster-devel@gluster.org http://supercolony.gluster.org/mailman/listinfo/gluster-devel
Re: [Gluster-devel] EHT / DHT
I think I have it. Unless I’m totally confused, I can hash ONLY on the filename with: glusterfs --volfile-server=a_server --volfile-id=a_volume \ --xlator-option a_volume-dht.extra_hash_regex='.*[/]' \ /a/mountpoint Correct? Jan From: Jan H Holtzhausen j...@holtztech.info Date: Tuesday 25 November 2014 at 9:06 PM To: gluster-devel@gluster.org Subject: Re: [Gluster-devel] EHT / DHT Are you referring to something else in your request? Meaning, you want /myfile, /dir1/myfile and /dir2/dir3/myfile to fall onto the same bricks/subvolumes and that perchance is what you are looking for? That is EXACTLY what I am looking for. What are my chances? BR Jan ___ Gluster-devel mailing list Gluster-devel@gluster.org http://supercolony.gluster.org/mailman/listinfo/gluster-devel ___ Gluster-devel mailing list Gluster-devel@gluster.org http://supercolony.gluster.org/mailman/listinfo/gluster-devel
Re: [Gluster-devel] EHT / DHT
On 11/25/2014 02:28 PM, Jan H Holtzhausen wrote: I think I have it. Unless I’m totally confused, I can hash ONLY on the filename with: glusterfs --volfile-server=a_server --volfile-id=a_volume \ --xlator-option a_volume-dht.extra_hash_regex='.*[/]' \ /a/mountpoint Correct? The hash of a file does not include the full path, it is on the file name _only_. So any regex will not work when the filename remains constant like myfile. As Jeff explains the option is really to prevent using temporary parts of the name in the hash computation (for rename optimization). In this case, you do not seem to have any tmp parts to the name, like myfile and myfile~ should evaluate to the same hash, so remove all trailing '~' from the name. So I am not sure the above is the option you are looking for. Jan From: Jan H Holtzhausen j...@holtztech.info mailto:j...@holtztech.info Date: Tuesday 25 November 2014 at 9:06 PM To: gluster-devel@gluster.org mailto:gluster-devel@gluster.org Subject: Re: [Gluster-devel] EHT / DHT Are you referring to something else in your request? Meaning, you want /myfile, /dir1/myfile and /dir2/dir3/myfile to fall onto the same bricks/subvolumes and that perchance is what you are looking for? That is EXACTLY whatI am looking for. What are my chances? As far as I know not much out of the box. As Jeff explained, the directory distribution/layout considers the GFID of the directory, hence each of the directories in the above example would/could get different ranges. The file on the other hand remains constant myfile so its hash value remains the same, but due to the distribution range change as above for the directories, it will land on different bricks and not the same one. Out of curiosity, why is this functionality needed? Shyam ___ Gluster-devel mailing list Gluster-devel@gluster.org http://supercolony.gluster.org/mailman/listinfo/gluster-devel
Re: [Gluster-devel] EHT / DHT
Hmm Then something is wrong, If I upload 2 identical files, with different paths they only end up on the same server 1/4 of the time (I have 4 bricks). I’ll test the regex quickly. BR Jan On 2014/11/25, 7:55 PM, Shyam srang...@redhat.com wrote: On 11/25/2014 02:28 PM, Jan H Holtzhausen wrote: I think I have it. Unless I’m totally confused, I can hash ONLY on the filename with: glusterfs --volfile-server=a_server --volfile-id=a_volume \ --xlator-option a_volume-dht.extra_hash_regex='.*[/]' \ /a/mountpoint Correct? The hash of a file does not include the full path, it is on the file name _only_. So any regex will not work when the filename remains constant like myfile. As Jeff explains the option is really to prevent using temporary parts of the name in the hash computation (for rename optimization). In this case, you do not seem to have any tmp parts to the name, like myfile and myfile~ should evaluate to the same hash, so remove all trailing '~' from the name. So I am not sure the above is the option you are looking for. Jan From: Jan H Holtzhausen j...@holtztech.info mailto:j...@holtztech.info Date: Tuesday 25 November 2014 at 9:06 PM To: gluster-devel@gluster.org mailto:gluster-devel@gluster.org Subject: Re: [Gluster-devel] EHT / DHT Are you referring to something else in your request? Meaning, you want /myfile, /dir1/myfile and /dir2/dir3/myfile to fall onto the same bricks/subvolumes and that perchance is what you are looking for? That is EXACTLY whatI am looking for. What are my chances? As far as I know not much out of the box. As Jeff explained, the directory distribution/layout considers the GFID of the directory, hence each of the directories in the above example would/could get different ranges. The file on the other hand remains constant myfile so its hash value remains the same, but due to the distribution range change as above for the directories, it will land on different bricks and not the same one. Out of curiosity, why is this functionality needed? Shyam ___ Gluster-devel mailing list Gluster-devel@gluster.org http://supercolony.gluster.org/mailman/listinfo/gluster-devel ___ Gluster-devel mailing list Gluster-devel@gluster.org http://supercolony.gluster.org/mailman/listinfo/gluster-devel
Re: [Gluster-devel] EHT / DHT
On 11/25/2014 03:11 PM, Jan H Holtzhausen wrote: STILL doesn’t work … exact same file ends up on 2 different bricks … I must be missing something. All I need is for: /directory1/subdirectory2/foo And /directory2/subdirectoryaaa999/foo To end up on the same brick…. This is not possible is what I was attempting to state in the previous mail. The regex filter is not for this purpose. The hash is always based on the name of the file, but the location is based on the distribution/layout of the directory, which is different for each directory based on its GFID. So there are no options in the code to enable what you seek at present. Why is this needed? Shyam ___ Gluster-devel mailing list Gluster-devel@gluster.org http://supercolony.gluster.org/mailman/listinfo/gluster-devel
Re: [Gluster-devel] EHT / DHT
So in a distributed cluster, the GFID tells all bricks what a files preceding directory structure looks like? Where the physical file is saved is a function of the filename ONLY. Therefore My requirement should be met by default, or am I being dense? BR Jan On 2014/11/25, 8:15 PM, Shyam srang...@redhat.com wrote: On 11/25/2014 03:11 PM, Jan H Holtzhausen wrote: STILL doesn’t work … exact same file ends up on 2 different bricks … I must be missing something. All I need is for: /directory1/subdirectory2/foo And /directory2/subdirectoryaaa999/foo To end up on the same brick…. This is not possible is what I was attempting to state in the previous mail. The regex filter is not for this purpose. The hash is always based on the name of the file, but the location is based on the distribution/layout of the directory, which is different for each directory based on its GFID. So there are no options in the code to enable what you seek at present. Why is this needed? Shyam ___ Gluster-devel mailing list Gluster-devel@gluster.org http://supercolony.gluster.org/mailman/listinfo/gluster-devel
Re: [Gluster-devel] EHT / DHT
As to the why. Filesystem cache hits. Files with the same name tend to be the same files. Regards Jan On 2014/11/25, 8:42 PM, Jan H Holtzhausen j...@holtztech.info wrote: So in a distributed cluster, the GFID tells all bricks what a files preceding directory structure looks like? Where the physical file is saved is a function of the filename ONLY. Therefore My requirement should be met by default, or am I being dense? BR Jan On 2014/11/25, 8:15 PM, Shyam srang...@redhat.com wrote: On 11/25/2014 03:11 PM, Jan H Holtzhausen wrote: STILL doesn’t work … exact same file ends up on 2 different bricks … I must be missing something. All I need is for: /directory1/subdirectory2/foo And /directory2/subdirectoryaaa999/foo To end up on the same brick…. This is not possible is what I was attempting to state in the previous mail. The regex filter is not for this purpose. The hash is always based on the name of the file, but the location is based on the distribution/layout of the directory, which is different for each directory based on its GFID. So there are no options in the code to enable what you seek at present. Why is this needed? Shyam ___ Gluster-devel mailing list Gluster-devel@gluster.org http://supercolony.gluster.org/mailman/listinfo/gluster-devel ___ Gluster-devel mailing list Gluster-devel@gluster.org http://supercolony.gluster.org/mailman/listinfo/gluster-devel
Re: [Gluster-devel] Single layout at root (Was EHT / DHT)
On Tue Nov 25 2014 at 1:28:59 PM Shyam srang...@redhat.com wrote: On 11/12/2014 01:55 AM, Anand Avati wrote: On Tue, Nov 11, 2014 at 1:56 PM, Jeff Darcy jda...@redhat.com mailto:jda...@redhat.com wrote: (Personally I would have done this by mixing in the parent GFID to the hash calculation, but that alternative was ignored.) Actually when DHT was implemented, the concept of GFID did not (yet) exist. Due to backward compatibility it has just remained this way even later. Including the GFID into the hash has benefits. I am curious here as this is interesting. So the layout start subvol assignment for a directory to be based on its GFID was provided so that files with the same name distribute better than ending up in the same bricks, right? Right, for e.g we wouldn't want all the README.txt in various directories of a volume to end up on the same server. The way it is achieved today is, the per server hash-range assignment is rotated by a certain amount (how much it is rotated is determined by a separate hash on the directory path) at the time of mkdir. Instead as we _now_ have GFID, we could use that including the name to get a similar/better distribution, or GFID+name to determine hashed subvol. What we could do now is, include the parent directory gfid as an input into the DHT hash function. Today, we do approximately: int hashval = dm_hash (readme.txt) hash_ranges[] = inode_ctx_get (parent_dir) subvol = find_subvol (hash_ranges, hashval) Instead, we could: int hashval = new_hash (readme.txt, parent_dir.gfid) hash_ranges[] = global_value subvol = find_subvol (hash_ranges, hashval) The idea here would be that on dentry creates we would need to generate the GFID and not let the bricks generate the same, so that we can choose the subvol to wind the FOP to. The GFID would be that of the parent (as an entry name is always in the context of a parent directory/inode). Also, the GFID for a new entry is already generated by the client, the brick does not generate a GFID. This eliminates the need for a layout per sub-directory and all the (interesting) problems that it comes with and instead can be replaced by a layout at root. Not sure if it handles all use cases and paths that we have now (which needs more understanding). I do understand there is a backward compatibility issue here, but other than this, this sounds better than the current scheme, as there is a single layout to read/optimize/stash/etc. across clients. Can I understand the rationale of this better, as to what you folks are thinking. Am I missing something or over reading on the benefits that this can provide? I think you understand it right. The benefit is one could have a single hash layout for the entire volume and the directory specific-ness is implemented by including the directory gfid into the hash function. The way I see it, the compromise would be something like: Pro per directory range: By having per-directory hash ranges, we can do easier incremental rebalance. Partial progress is well tolerated and does not impact the entire volume. The time a given directory is undergoing rebalance, for that directory alone we need to enter unhashed lookup mode, only for that period of time. Con per directory range: Just the new hash assignment phase (to impact placement of new files/data, not move old data) itself is an extended process, crawling the entire volume with complex per-directory operations. The number of points in the system where things can break (i.e, result in overlaps and holes in ranges) is high. Pro single layout with dir GFID in hash: Avoid the numerous parts (per-dir hash ranges) which can potentially break. Con single layout with dir GFID in hash: Rebalance phase 1 (assigning new layout) is atomic for the entire volume - unhashed lookup has to be on for all dirs for the entire period. To mitigate this, we could explore versioning the centralized hash ranges, and store the version used by each directory in its xattrs (and update the version as the rebalance progresses). But now we have more centralized metadata (may be/ may not be a worthy compromise - not sure.) In summary, including GFID into the hash calculation does open up interesting possibilities and worthy of serious consideration. HTH, Avati ___ Gluster-devel mailing list Gluster-devel@gluster.org http://supercolony.gluster.org/mailman/listinfo/gluster-devel
Re: [Gluster-devel] Single layout at root (Was EHT / DHT)
On 11/25/2014 05:03 PM, Anand Avati wrote: On Tue Nov 25 2014 at 1:28:59 PM Shyam srang...@redhat.com mailto:srang...@redhat.com wrote: On 11/12/2014 01:55 AM, Anand Avati wrote: On Tue, Nov 11, 2014 at 1:56 PM, Jeff Darcy jda...@redhat.com mailto:jda...@redhat.com mailto:jda...@redhat.com mailto:jda...@redhat.com wrote: (Personally I would have done this by mixing in the parent GFID to the hash calculation, but that alternative was ignored.) Actually when DHT was implemented, the concept of GFID did not (yet) exist. Due to backward compatibility it has just remained this way even later. Including the GFID into the hash has benefits. I am curious here as this is interesting. So the layout start subvol assignment for a directory to be based on its GFID was provided so that files with the same name distribute better than ending up in the same bricks, right? Right, for e.g we wouldn't want all the README.txt in various directories of a volume to end up on the same server. The way it is achieved today is, the per server hash-range assignment is rotated by a certain amount (how much it is rotated is determined by a separate hash on the directory path) at the time of mkdir. Instead as we _now_ have GFID, we could use that including the name to get a similar/better distribution, or GFID+name to determine hashed subvol. What we could do now is, include the parent directory gfid as an input into the DHT hash function. Today, we do approximately: int hashval = dm_hash (readme.txt) hash_ranges[] = inode_ctx_get (parent_dir) subvol = find_subvol (hash_ranges, hashval) Instead, we could: int hashval = new_hash (readme.txt, parent_dir.gfid) hash_ranges[] = global_value subvol = find_subvol (hash_ranges, hashval) The idea here would be that on dentry creates we would need to generate the GFID and not let the bricks generate the same, so that we can choose the subvol to wind the FOP to. The GFID would be that of the parent (as an entry name is always in the context of a parent directory/inode). Also, the GFID for a new entry is already generated by the client, the brick does not generate a GFID. This eliminates the need for a layout per sub-directory and all the (interesting) problems that it comes with and instead can be replaced by a layout at root. Not sure if it handles all use cases and paths that we have now (which needs more understanding). I do understand there is a backward compatibility issue here, but other than this, this sounds better than the current scheme, as there is a single layout to read/optimize/stash/etc. across clients. Can I understand the rationale of this better, as to what you folks are thinking. Am I missing something or over reading on the benefits that this can provide? I think you understand it right. The benefit is one could have a single hash layout for the entire volume and the directory specific-ness is implemented by including the directory gfid into the hash function. The way I see it, the compromise would be something like: Pro per directory range: By having per-directory hash ranges, we can do easier incremental rebalance. Partial progress is well tolerated and does not impact the entire volume. The time a given directory is undergoing rebalance, for that directory alone we need to enter unhashed lookup mode, only for that period of time. Con per directory range: Just the new hash assignment phase (to impact placement of new files/data, not move old data) itself is an extended process, crawling the entire volume with complex per-directory operations. The number of points in the system where things can break (i.e, result in overlaps and holes in ranges) is high. Pro single layout with dir GFID in hash: Avoid the numerous parts (per-dir hash ranges) which can potentially break. Con single layout with dir GFID in hash: Rebalance phase 1 (assigning new layout) is atomic for the entire volume - unhashed lookup has to be on for all dirs for the entire period. To mitigate this, we could explore versioning the centralized hash ranges, and store the version used by each directory in its xattrs (and update the version as the rebalance progresses). But now we have more centralized metadata (may be/ may not be a worthy compromise - not sure.) Agreed, the auto-unhased would have to wait longer before being rearmed. Just throwing some more thoughts on the same, Unhashed-auto also can benefit from just linkto creations, rather than require a data rebalance (i.e movement of data). So in phase-0 we could just create the linkto files and then turn on auto-unhashed. As lookups would find the (linkto) file. Other abilities, like giving directories weighted layout ranges based on size of bricks could be affected, i.e forcing a rebalance when a brick size is
Re: [Gluster-devel] EHT / DHT
Out of curiosity, what back end and deduplication solution are you using? Regards, Poornima - Original Message - From: Jan H Holtzhausen j...@holtztech.info To: Anand Avati av...@gluster.org, Shyam srang...@redhat.com, gluster-devel@gluster.org Sent: Wednesday, November 26, 2014 3:43:36 AM Subject: Re: [Gluster-devel] EHT / DHT Yes we have deduplication at the filesystem layer BR Jan From: Anand Avati av...@gluster.org Date: Wednesday 26 November 2014 at 12:11 AM To: Jan H Holtzhausen j...@holtztech.info , Shyam srang...@redhat.com , gluster-devel@gluster.org Subject: Re: [Gluster-devel] EHT / DHT Unless there is some sort of de-duplication under the covers happening in the brick, or the files are hardlinks to each other, there is no cache benefit whatsoever by having identical files placed on the same server. Thanks, Avati On Tue Nov 25 2014 at 12:59:25 PM Jan H Holtzhausen j...@holtztech.info wrote: As to the why. Filesystem cache hits. Files with the same name tend to be the same files. Regards Jan On 2014/11/25, 8:42 PM, Jan H Holtzhausen j...@holtztech.info wrote: So in a distributed cluster, the GFID tells all bricks what a files preceding directory structure looks like? Where the physical file is saved is a function of the filename ONLY. Therefore My requirement should be met by default, or am I being dense? BR Jan On 2014/11/25, 8:15 PM, Shyam srang...@redhat.com wrote: On 11/25/2014 03:11 PM, Jan H Holtzhausen wrote: STILL doesn’t work … exact same file ends up on 2 different bricks … I must be missing something. All I need is for: /directory1/subdirectory2/foo And /directory2/ subdirectoryaaa999/foo To end up on the same brick…. This is not possible is what I was attempting to state in the previous mail. The regex filter is not for this purpose. The hash is always based on the name of the file, but the location is based on the distribution/layout of the directory, which is different for each directory based on its GFID. So there are no options in the code to enable what you seek at present. Why is this needed? Shyam _ __ Gluster-devel mailing list Gluster-devel@gluster.org http://supercolony.gluster. org/mailman/listinfo/gluster- devel __ _ Gluster-devel mailing list Gluster-devel@gluster.org http://supercolony.gluster. org/mailman/listinfo/gluster- devel ___ Gluster-devel mailing list Gluster-devel@gluster.org http://supercolony.gluster.org/mailman/listinfo/gluster-devel ___ Gluster-devel mailing list Gluster-devel@gluster.org http://supercolony.gluster.org/mailman/listinfo/gluster-devel
Re: [Gluster-devel] EHT / DHT
I could tell you… But Symantec wouldn’t like it….. From: Poornima Gurusiddaiah pguru...@redhat.com Date: Wednesday 26 November 2014 at 7:16 AM To: Jan H Holtzhausen j...@holtztech.info Cc: gluster-devel@gluster.org Subject: Re: [Gluster-devel] EHT / DHT Out of curiosity, what back end and deduplication solution are you using? Regards, Poornima From: Jan H Holtzhausen j...@holtztech.info To: Anand Avati av...@gluster.org, Shyam srang...@redhat.com, gluster-devel@gluster.org Sent: Wednesday, November 26, 2014 3:43:36 AM Subject: Re: [Gluster-devel] EHT / DHT Yes we have deduplication at the filesystem layer BR Jan From: Anand Avati av...@gluster.org Date: Wednesday 26 November 2014 at 12:11 AM To: Jan H Holtzhausen j...@holtztech.info, Shyam srang...@redhat.com, gluster-devel@gluster.org Subject: Re: [Gluster-devel] EHT / DHT Unless there is some sort of de-duplication under the covers happening in the brick, or the files are hardlinks to each other, there is no cache benefit whatsoever by having identical files placed on the same server. Thanks, Avati On Tue Nov 25 2014 at 12:59:25 PM Jan H Holtzhausen j...@holtztech.info wrote: As to the why. Filesystem cache hits. Files with the same name tend to be the same files. Regards Jan On 2014/11/25, 8:42 PM, Jan H Holtzhausen j...@holtztech.info wrote: So in a distributed cluster, the GFID tells all bricks what a files preceding directory structure looks like? Where the physical file is saved is a function of the filename ONLY. Therefore My requirement should be met by default, or am I being dense? BR Jan On 2014/11/25, 8:15 PM, Shyam srang...@redhat.com wrote: On 11/25/2014 03:11 PM, Jan H Holtzhausen wrote: STILL doesn’t work … exact same file ends up on 2 different bricks … I must be missing something. All I need is for: /directory1/subdirectory2/foo And /directory2/subdirectoryaaa999/foo To end up on the same brick…. This is not possible is what I was attempting to state in the previous mail. The regex filter is not for this purpose. The hash is always based on the name of the file, but the location is based on the distribution/layout of the directory, which is different for each directory based on its GFID. So there are no options in the code to enable what you seek at present. Why is this needed? Shyam ___ Gluster-devel mailing list Gluster-devel@gluster.org http://supercolony.gluster.org/mailman/listinfo/gluster-devel ___ Gluster-devel mailing list Gluster-devel@gluster.org http://supercolony.gluster.org/mailman/listinfo/gluster-devel ___ Gluster-devel mailing list Gluster-devel@gluster.org http://supercolony.gluster.org/mailman/listinfo/gluster-devel ___ Gluster-devel mailing list Gluster-devel@gluster.org http://supercolony.gluster.org/mailman/listinfo/gluster-devel