Re: [Gluster-devel] AFR conservative merge portability
On 12/18/2014 01:28 PM, Emmanuel Dreyfus wrote: On Mon, Dec 15, 2014 at 03:21:24PM -0500, Jeff Darcy wrote: Is there *any* case, not even necessarily involving conservative merge, where it would be harmful to propagate the latest ctime/mtime for any replica of a directory? In case of conservative merge, the problem vanish on its own anyway: adding entries updates parent directory ctime/mtime and the reported split brain does not exists anymore. Here is a first attempt, please comment: http://review.gluster.org/9291 Hi Emmanuel, So we (AFR team) had a discussion and came up with two things that need to be done w.r.t. this issue: 1. First in metadata heal, if the metadata SB is only due to [am]time, heal the file choosing the source as the one having the max of atime/mtime. 2. Currently in entry-self heal, after conservative merge, the dir's timestamp is updated using the time when self heal happened and not that of the dirs on the bricks. This needs to be changed to use the timestamp of the source having max mtime, similar to what data selfheal does in afr_selfheal_data_restore_time() Point #1 would be addressed by your patch with some modifications (pending review );that just leaves #2 to be done. Thanks, Ravi ___ Gluster-devel mailing list Gluster-devel@gluster.org http://supercolony.gluster.org/mailman/listinfo/gluster-devel
Re: [Gluster-devel] Readdir d_off encoding
On 12/17/2014 05:04 AM, Xavier Hernandez wrote: Just to consider all possibilities... Current architecture needs to create all directory structure on all bricks, and has the big problem that each directory in each brick will store the files in different order and with different d_off values. I gather that this is when EC or AFR is in place, as for DHT a file is on one brick only. This is a serious scalability issue and have many inconveniences when trying to heal or detect inconsistencies between bricks (basically we would need to read full directory contents of each brick to compare them). I am not quite familiar with EC so pardon the ignorance. Why/How does d_off play a role in this healing/crawling? An alternative would be to convert directories into regular files from the brick point of view. The benefits of this would be: * d_off would be controlled by gluster, so all bricks would have the same d_off and order. No need to use any d_off mapping or transformation. * Directories could take advantage of replication and disperse self-heal procedures. They could be treated as files and be healed more easily. A corrupted brick would not produce invalid directory contents, and file duplication in directory listing would be avoided. * Many of the complexities in DHT, AFR and EC to manage directories would be removed. The main issue could be the need of an upper level xlator that would transform directory requests into file modifications and would be responsible of managing all d_off assignment and directory manipulation (renames, links, unlinks, ...). This is tending towards some thoughts for Gluster 4.0 and specifically DHT in 4.0. I am going to wait for the same/similar comments as we discuss those specifics (hopefully published before Christmas (2014)). Xavi On 12/16/2014 03:06 AM, Anand Avati wrote: Replies inline On Mon Dec 15 2014 at 12:46:41 PM Shyam srang...@redhat.com mailto:srang...@redhat.com wrote: With the changes present in [1] and [2], A short explanation of the change would be, we encode the subvol ID in the d_off, losing 'n + 1' bits in case the high order n+1 bits of the underlying xlator returned d_off is not free. (Best to read the commit message for [1] :) ) Although not related to the latest patch, here is something to consider for the future: We now have DHT, AFR, EC(?), DHT over DHT (Tier) which need subvol encoding in the returned readdir offset. Due to this, the loss in bits _may_ cause unwanted offset behavior, when used in the current scheme. As we would end up eating more bits than what we do at present. Or IOW, we could be invalidating the assumption both EXT4/XFS are tolerant in terms of the accuracy of the value presented back in seekdir(). XFS has not been a problem, since it always returns 32bit d_off. With Ext4, it has been noted that it is tolerant to sacrificing the lower bits in accuracy. i.e, a seekdir(val) actually seeks to the entry which has the closest true offset. Should we reconsider an in memory _cookie_ like approach that can help in this case? It would invalidate (some or all based on the implementation) the following constraints that the current design resolves, (from, [1]) - Nothing to remember in memory or evict old entries. - Works fine across NFS server reboots and also NFS head failover. - Tolerant to seekdir() to arbitrary locations. But, would provide a more reliable readdir offset for use (when valid and not evicted, say). How would NFS adapt to this? Does Ganesha need a better scheme when doing multi-head NFS fail over? Ganesha just offloads the responsibility to the FSAL layer to give stable dir cookies (as it rightly should) Thoughts? I think we need to analyze the actual assumption/problem here. Remembering things in memory comes with the limitations you note above, and may after all, still not be necessary. Let's look at the two approaches taken: - Small backend offsets: like XFS, the offsets fit in 32bits, and we are left with another 32bits of freedom to encode what we want. There is no problem here until our nested encoding requirements cross 32bits of space. So let's ignore this for now. - Large backend offsets: Ext4 being the primary target. Here we observe that the backend filesystem is tolerant to sacrificing the accuracy of lower bits. So we overwrite the lower bits with our subvolume encoding information, and the number of bits used to encode is implicit in the subvolume cardinality of that translator. While this works fine with a single transformation, it is clearly a problem when the transformation is nested with the same algorithm. The reason is quite simple: while the lower bits were disposable when the cookie was taken fresh from Ext4, once transformed the same lower bits are now holy and cannot be overwritten carelessly, at least without dire consequences. The higher level
Re: [Gluster-devel] Updates to operating-version
James, why not just compute the operating version? After 3.5.0 it's always XYYZZ based on the version. Something along the lines of $version_array = split(${gluster_version}, '[.]') if $version_array[0] 3 { fail(Unsupported GlusterFS Version) } $operating_version = $version_array[2] ? { '4' = '2', '5' = $version_array[3] ? { '0' = '3', default = sprintf(%d%02d%02d, $version_array), }, default = sprintf(%d%02d%02d, $version_array), } Perhaps a CLI command to fetch the GD_OP_VERSION_MAX might be beneficial as well. On 12/17/2014 11:30 PM, Kaushal M wrote: In that case, I should send a note as the op-version has been bumped for the master branch. Please take note, The operating-version for the master branch has been bumped to '30700', which is aligned with the next release of GlusterFS, 3.7. ~kaushal On Thu, Dec 18, 2014 at 12:49 PM, Lalatendu Mohanty lmoha...@redhat.com wrote: On 12/17/2014 07:39 PM, Niels de Vos wrote: On Wed, Dec 17, 2014 at 08:40:18AM -0500, James wrote: Hello, If you plan on updating the operating-version value of GlusterFS, please either ping me (@purpleidea) or send a patch to puppet-gluster [1]. Patches are 4 line yaml files, and you don't need any knowledge of puppet or yaml to do so. Example: +# gluster/data/versions/3.6.yaml +--- +gluster::versions::operating_version: '30600' # v3.6.0 +# vim: ts=8 As seen at: https://github.com/purpleidea/puppet-gluster/commit/43c60d2ddd6f57d2117585dc149de6653bdabd4b#diff-7cb3f60a533975d869ffd4a772d66cfeR1 Thanks for your cooperation! This will ensure puppet-gluster can always correctly work with new versions of GlusterFS. How about you post a patch that adds this request as a comment in the glusterfs sources (libglusterfs/src/globals.h)? Or, maybe this should be noted on some wiki page, and have the comment point to the wiki instead. Maybe other projects start to use the op-version in future too, and they also need to get informed about a change. IMO we should make it a practice to send a mail to gluster-devel whenever a patch is sent to increase the operating-version. Similar to practice what Fedora follows for so version bump. -Lala ___ Gluster-devel mailing list Gluster-devel@gluster.org http://supercolony.gluster.org/mailman/listinfo/gluster-devel ___ Gluster-devel mailing list Gluster-devel@gluster.org http://supercolony.gluster.org/mailman/listinfo/gluster-devel ___ Gluster-devel mailing list Gluster-devel@gluster.org http://supercolony.gluster.org/mailman/listinfo/gluster-devel
Re: [Gluster-devel] Updates to operating-version
On Wed, Dec 17, 2014 at 9:09 AM, Niels de Vos nde...@redhat.com wrote: On Wed, Dec 17, 2014 at 08:40:18AM -0500, James wrote: How about you post a patch that adds this request as a comment in the glusterfs sources (libglusterfs/src/globals.h)? Good idea actually... Please review/ack/merge :) http://review.gluster.org/#/c/9301/ Or, maybe this should be noted on some wiki page, Already updated the wiki yesterday... https://www.gluster.org/community/documentation/index.php/OperatingVersions and have the comment point to the wiki instead. Maybe other projects start to use the op-version in future too, and they also need to get informed about a change. If that becomes the case, we can change this :) See a comment about this in my next email... Thanks! James Thanks, Niels ___ Gluster-devel mailing list Gluster-devel@gluster.org http://supercolony.gluster.org/mailman/listinfo/gluster-devel
Re: [Gluster-devel] Updates to operating-version
On Thu, Dec 18, 2014 at 11:40 AM, Joe Julian j...@julianfamily.org wrote: James, why not just compute the operating version? After 3.5.0 it's always XYYZZ based on the version. Something along the lines of $version_array = split(${gluster_version}, '[.]') if $version_array[0] 3 { fail(Unsupported GlusterFS Version) } $operating_version = $version_array[2] ? { '4' = '2', '5' = $version_array[3] ? { '0' = '3', default = sprintf(%d%02d%02d, $version_array), }, default = sprintf(%d%02d%02d, $version_array), } Perhaps a CLI command to fetch the GD_OP_VERSION_MAX might be beneficial as well. This is a very good point actually... In fact, it begs the question: If it can be computed from the version string, why doesn't GlusterFS do this internally in libglusterfs/src/globals.h ? I'm guessing perhaps there's a reason your computation isn't always correct... Since that's not the case, I figured I'd just match whatever Gluster is doing by actually storing the values in a yaml (hiera) table. For now I think it's fine, but if someone has better information, lmk! On 12/17/2014 11:30 PM, Kaushal M wrote: In that case, I should send a note as the op-version has been bumped for the master branch. Please take note, The operating-version for the master branch has been bumped to '30700', which is aligned with the next release of GlusterFS, 3.7. ~kaushal On Thu, Dec 18, 2014 at 12:49 PM, Lalatendu Mohanty lmoha...@redhat.com wrote: On 12/17/2014 07:39 PM, Niels de Vos wrote: On Wed, Dec 17, 2014 at 08:40:18AM -0500, James wrote: Hello, If you plan on updating the operating-version value of GlusterFS, please either ping me (@purpleidea) or send a patch to puppet-gluster [1]. Patches are 4 line yaml files, and you don't need any knowledge of puppet or yaml to do so. Example: +# gluster/data/versions/3.6.yaml +--- +gluster::versions::operating_version: '30600' # v3.6.0 +# vim: ts=8 As seen at: https://github.com/purpleidea/puppet-gluster/commit/43c60d2ddd6f57d2117585dc149de6653bdabd4b#diff-7cb3f60a533975d869ffd4a772d66cfeR1 Thanks for your cooperation! This will ensure puppet-gluster can always correctly work with new versions of GlusterFS. How about you post a patch that adds this request as a comment in the glusterfs sources (libglusterfs/src/globals.h)? Or, maybe this should be noted on some wiki page, and have the comment point to the wiki instead. Maybe other projects start to use the op-version in future too, and they also need to get informed about a change. IMO we should make it a practice to send a mail to gluster-devel whenever a patch is sent to increase the operating-version. Similar to practice what Fedora follows for so version bump. -Lala ___ Gluster-devel mailing list Gluster-devel@gluster.org http://supercolony.gluster.org/mailman/listinfo/gluster-devel ___ Gluster-devel mailing list Gluster-devel@gluster.org http://supercolony.gluster.org/mailman/listinfo/gluster-devel ___ Gluster-devel mailing list Gluster-devel@gluster.org http://supercolony.gluster.org/mailman/listinfo/gluster-devel ___ Gluster-devel mailing list Gluster-devel@gluster.org http://supercolony.gluster.org/mailman/listinfo/gluster-devel
Re: [Gluster-devel] Updates to operating-version
On Thu, Dec 18, 2014 at 2:30 AM, Kaushal M kshlms...@gmail.com wrote: In that case, I should send a note as the op-version has been bumped for the master branch. Please take note, The operating-version for the master branch has been bumped to '30700', which is aligned with the next release of GlusterFS, 3.7. Cool, thanks. As reference, the four line patch looks like: https://github.com/purpleidea/puppet-gluster/commit/c2291084cf818d0058a66dcbc0984bcea7b51252 and is now in git master. Future patches are welcome :) Cheers, James ___ Gluster-devel mailing list Gluster-devel@gluster.org http://supercolony.gluster.org/mailman/listinfo/gluster-devel
Re: [Gluster-devel] Volume management proposal (4.0)
It seems simplest to store child-parent relationships (one to one) instead of parent-child relationships (one to many). Based on that, I looked at some info files and saw that we're already using parent_volname for snapshot stuff. Maybe we need to change terminology. Let's say that we use part-of in the info file. The above persistence scheme makes querying for volumes affected by a change to a given volume, linear in length of path from given volume to the primary volume, the 'root', in the graph of volumes. The alternative would involve going through every volume to check if the volume changed affects it. This is linear in no. of volumes in the cluster. This computational complexity makes me favour storing child-parent relationships in the secondary volumes. The only down-side is that we need to 'lock-down' secondary volumes from being modified. I don't have a way to measure the effect (yet) this would have on (concurrent) modifications on secondary volumes of a given primary volume. * Create a new string-valued glusterd_volinfo_t.part_of field. * This gets filled in from glusterd_store_update_volinfo along with everything else from the info file. * When a composite volume is created, its component volumes' info files are rewritten. * When a component volume is modified, use the part_of field to find its parent. We then generate the fully-resolved client volfiles before and after the change and compare for differences. * If we find differences in the parent, process the change as though it had been made on the parent (triggering graph switches etc.) and then use the parent's part_of field to repeat the process one level up. I don't think we need to do anything for server-side-only changes, since those will already be handled (e.g. starting new bricks) by the existing infrastructure. However, things like NFS and quotad might need to go through the same process outlined above for clients. This makes sense. ___ Gluster-devel mailing list Gluster-devel@gluster.org http://supercolony.gluster.org/mailman/listinfo/gluster-devel
Re: [Gluster-devel] GlusterFS Volume backup API
Few concerns inline JOE - Original Message - From: Aravinda avish...@redhat.com To: gluster Devel gluster-devel@gluster.org Sent: Thursday, December 18, 2014 10:38:20 PM Subject: [Gluster-devel] GlusterFS Volume backup API Hi, Today we discussed about GlusterFS backup API, our plan is to provide a tool/api to get list of changed files(Full/incremental) Participants: Me, Kotresh, Ajeet, Shilpa Thanks to Paul Cuzner for providing inputs about pre and post hooks available in backup utilities like NetBackup. Initial draft: == Case 1 - Registered Consumer Consumer application has to register by giving a session name. glusterbackupapi register sessionname host volume When the following command run for the first time, it will do full scan. next onwards it does incremental. Start time for incremental is last backup time, endtime will be current time. glusterbackupapi sessionname --out-file=out.txt --out-file is optional argument, default output file name is `output.txt`. Output file will have file paths. Case 2 - Unregistered Consumer - Start time and end time information will not be remembered, every time consumer has to send start time and end time if incremental. For Full backup, glusterbackupapi full host volume --out-file=out.txt For Incremental backup, glusterbackupapi inc host volume STARTTIME ENDTIME --out-file=out.txt where STARTTIME and ENDTIME are in unix timestamp format. Technical overview == 1. Using host and volume name arguments, it fetches volume info and volume status to get the list of up bricks/nodes. 2. Executes brick/node agent to get required details from brick. (TBD: communication via RPC/SSH/gluster system:: execute) 3. If full scan, brick/node agent will gets list of files from that brick backend and generates output file. 4. If incremental, it calls Changelog History API, gets distinct GFID's list and then converts each GFID to path. 5. Generated output files from each brick node will be copied to initiator node. 6. Merges all the output files from bricks and removes duplicates. 7. In case of session based access, session information will be saved by each brick/node agent. Issues/Challenges = 1. If timestamp different in gluster nodes. We are assuming, in a cluster TS will remain same. 2. If a brick is down, how to handle? We are assuming, all the bricks should be up to initiate backup(atleast one from each replica) 3. If changelog not available, or broken in between start time and end time, then how to get the incremental files list. As a prerequisite, changelog should be enabled before backup. JOE Performance overhead on IO path when changelog is switched on. I think getting numbers or a performance matrix here would be very crucial, as its not desirable to sacrifice on File IO performance to support Backup API or any data maintenance activity. 4. GFID to path conversion, using `find -samefile` or using `glusterfs.pathinfo` xattr on aux-gfid-mount. 5. Deleted files, if we get GFID of a deleted file from changelog how to find path. Do backup api requires deleted files list? JOE 1) find would not be a good option here as you have to traverse through the whole namespace. Takes a toll on the spindle based media. 2) glusterfs.pathinfo xattr is a feasible approach but has its own problems, a. This xattr comes only with quota, So you need to decouple it from quota. b. This xattr should be enabled from the beginning of namespace i.e if enable later you will some file which will have this xattr and some which wont have it. This issue is true for any meta storing approach in gluster for eg : DB, Changelog etc c. I am not sure if this xattr has a support for multiple had links. I am not sure if you (the backup scenario) would require it or not. Just food for thought. d. This xattr is not crash consistent with power failures. That means you may be in a state where few inodes will have the xattr and few won't. 3) Agree with the delete problem. This problem gets worse with multiple hard links. If some hard links are recorded and few are not recorded. 6. Storing session info in each brick nodes. 7. Communication channel between nodes, RPC/SSH/gluster system:: execute... etc? Kotresh, Ajeet, Please add if I missed any points. -- regards Aravinda ___ Gluster-devel mailing list Gluster-devel@gluster.org http://supercolony.gluster.org/mailman/listinfo/gluster-devel ___ Gluster-devel mailing list Gluster-devel@gluster.org http://supercolony.gluster.org/mailman/listinfo/gluster-devel
Re: [Gluster-devel] AFR conservative merge portability
On 12/19/2014 08:23 AM, Emmanuel Dreyfus wrote: Ravishankar N ravishan...@redhat.com wrote: Point #1 would be addressed by your patch with some modifications (pending review ); I addressed the points you raised but now my patch is failing just newly introduced ./tests/bugs/afr-quota-xattr-mdata-heal.t See there: http://build.gluster.org/job/rackspace-regression-2GB-triggered/3215/console Some help would be welcome on that front. There seems to be one more catch. afr_is_dirtime_splitbrain() only compares equality of type,gfid, mode, uid, gid. We need to check if application set xattrs are equal as well. mkdir /mnt/dir kill brick0 setfattr -n user.attr1 -v value1 /mnt/dir kill brick1, bring up brick0 sleep 10 touch /mnt/dir bring both bricks up. Now metadataheal mustn't be triggered. -Ravi ___ Gluster-devel mailing list Gluster-devel@gluster.org http://supercolony.gluster.org/mailman/listinfo/gluster-devel
[Gluster-devel] Snapshot and Data Tiering
Hi All, These are the MOM of the snapshot and data tiering interops meet (apologies for the late update) 1) USS should not have problems with the changes made in DHT (DHT over DHT), as USS xlator sits above DHT. 2) With the introduction of the heat capturing DB we have few things to take care off, when a snapshot of the brick is taken a. Location of the sqlite3 files: Today the location of the sqlite3 files by default reside in the brick (brick_path/.glusterfs/) this make taking the snapshot of the db easier as it is done via LVM along with the brick. If the location is outside the brick(which is configurable eg: have all the DB files in SSD for better performance), then during taking a snapshot glusterd needs to take a manual backup of these files, which would take some time and the gluster CLI would timeout. So for the first cut we would have the DB files in the brick itself, until we have a solution for CLI timeout. b. Type of the DataBase: For the first cut we are considering only sqlite3. And sqlite3 works excellent with LVM snapshots. If a new DB type like leveldb is introduced in the future, we need to investigate on its compatibility with LVM snapshots. And this might be a deciding factor to have such a DB type in gluster. c. Check-pointing the Sqlite3 DB: Before taking a snapshot, Glusterd should issue a checkpoint command to the Sqlite3 DB to flush all the db cache on to the Disk. Action item on Data Tiering team: 1) To give the time taken to do so. i.e checkpointing time 2) Provide a generic API in libgfdb to do so OR handle the CTR xlator notification from glusterd to do checkpointing Action item on snapshot team : 1) provide hooks to call the generic API OR do the brick-ops to notify the CTR Xlator d. Snapshot aware bricks: For a brick belonging to a snapshot the CTR xlator should not record reads (which come from USS). Solution 1) send CTR Xlator notification after the snapshot brick is started to turn off recording 2) OR While the snapshot brick is started by glusterd pass a option marking the brick to be apart of snapshot. This is more generic solution. 3) The snapshot restore problem : When a snapshot is restored, 1) it will bring the volume to the point-in-time state i.e for example The current state of the volume is, HOT tier has 50 % of data COLD tier has 50 % of data. And the snapshot has the volume in the state HOT Tier has 20 % of data COLD tier has 80 % of data. A restore will bring the volume to HOT:20% COLD:80%. i.e it will undo all the promotions and demotions. This should be mentioned in the documentation. 2) In addition to this, since the restored DB has time recorded in the past, File that were considered HOT in the past are now COLD. This will have all the data moved to the COLD tier if an data tiering scanner runs after the restore of the snapshot. This should be recorded in the documentation as a recommendation that not to run the data tiering scanner immediately after a restore of snapshot. The System should be given time to learn the new heat patterns. The learning time depends on nature of work load. 4) During a data tiering activity snapshot activities like create/restore should be disables, just as it is done during adding and removing of the brick, which leads to a rebalance. Let me know if anything else is missing or any correction are required. Regards, Joe ___ Gluster-devel mailing list Gluster-devel@gluster.org http://supercolony.gluster.org/mailman/listinfo/gluster-devel
Re: [Gluster-devel] 3.6.1 issue
On Tuesday 16 December 2014 10:59 PM, David F. Robinson wrote: Gluster 3.6.1 seems to be having an issue creating symbolic links. To reproduce this issue, I downloaded the file dakota-6.1-public.src_.tar.gz from https://dakota.sandia.gov/download.html # gunzip dakota-6.1-public.src_.tar.gz # tar -xf dakota-6.1-public.src_.tar # cd dakota-6.1.0.src/examples/script_interfaces/TankExamples/DakotaList # ls -al *_### Results from my old storage system (non gluster)_* corvidpost5:TankExamples/DakotaList ls -al total 12 drwxr-x--- 2 dfrobins users 112 Dec 16 12:12 ./ drwxr-x--- 6 dfrobins users 117 Dec 16 12:12 ../ *lrwxrwxrwx 1 dfrobins users 25 Dec 16 12:12 EvalTank.py - ../tank_model/EvalTank.py* lrwxrwxrwx 1 dfrobins users 24 Dec 16 12:12 FEMTank.py - ../tank_model/FEMTank.py* -rwx--x--- 1 dfrobins users 734 Nov 7 11:05 RunTank.sh* -rw--- 1 dfrobins users 1432 Nov 7 11:05 dakota_PandL_list.in -rw--- 1 dfrobins users 1860 Nov 7 11:05 dakota_Ponly_list.in *_### Results from gluster (broken links that have no permissions)_* corvidpost5:TankExamples/DakotaList ls -al total 5 drwxr-x--- 2 dfrobins users 166 Dec 12 08:43 ./ drwxr-x--- 6 dfrobins users 445 Dec 12 08:43 ../ *-- 1 dfrobins users0 Dec 12 08:43 EvalTank.py -- 1 dfrobins users0 Dec 12 08:43 FEMTank.py* -rwx--x--- 1 dfrobins users 734 Nov 7 11:05 RunTank.sh* -rw--- 1 dfrobins users 1432 Nov 7 11:05 dakota_PandL_list.in -rw--- 1 dfrobins users 1860 Nov 7 11:05 dakota_Ponly_list.in === David F. Robinson, Ph.D. President - Corvid Technologies 704.799.6944 x101 [office] 704.252.1310 [cell] 704.799.7974 [fax] david.robin...@corvidtec.com mailto:david.robin...@corvidtec.com http://www.corvidtechnologies.com ___ Gluster-devel mailing list Gluster-devel@gluster.org http://supercolony.gluster.org/mailman/listinfo/gluster-devel Hi David, Can you please provide the log files? You can find them in /var/log/glusterfs. Regards, Raghavendra Bhat ___ Gluster-devel mailing list Gluster-devel@gluster.org http://supercolony.gluster.org/mailman/listinfo/gluster-devel