Re: [Gluster-devel] 3.6.1 issue

2014-12-18 Thread Raghavendra Bhat

On Tuesday 16 December 2014 10:59 PM, David F. Robinson wrote:
Gluster 3.6.1 seems to be having an issue creating symbolic links.  To 
reproduce this issue, I downloaded the file 
dakota-6.1-public.src_.tar.gz from

https://dakota.sandia.gov/download.html
# gunzip dakota-6.1-public.src_.tar.gz
# tar -xf dakota-6.1-public.src_.tar
# cd dakota-6.1.0.src/examples/script_interfaces/TankExamples/DakotaList
# ls -al
*_### Results from my old storage system (non gluster)_*
corvidpost5:TankExamples/DakotaList> ls -al
total 12
drwxr-x--- 2 dfrobins users  112 Dec 16 12:12 ./
drwxr-x--- 6 dfrobins users  117 Dec 16 12:12 ../
*lrwxrwxrwx 1 dfrobins users   25 Dec 16 12:12 EvalTank.py -> 
../tank_model/EvalTank.py*
lrwxrwxrwx 1 dfrobins users   24 Dec 16 12:12 FEMTank.py -> 
../tank_model/FEMTank.py*

-rwx--x--- 1 dfrobins users  734 Nov  7 11:05 RunTank.sh*
-rw--- 1 dfrobins users 1432 Nov  7 11:05 dakota_PandL_list.in
-rw--- 1 dfrobins users 1860 Nov  7 11:05 dakota_Ponly_list.in
*_### Results from gluster (broken links that have no permissions)_*
corvidpost5:TankExamples/DakotaList> ls -al
total 5
drwxr-x--- 2 dfrobins users  166 Dec 12 08:43 ./
drwxr-x--- 6 dfrobins users  445 Dec 12 08:43 ../
*-- 1 dfrobins users0 Dec 12 08:43 EvalTank.py
-- 1 dfrobins users0 Dec 12 08:43 FEMTank.py*
-rwx--x--- 1 dfrobins users  734 Nov  7 11:05 RunTank.sh*
-rw--- 1 dfrobins users 1432 Nov  7 11:05 dakota_PandL_list.in
-rw--- 1 dfrobins users 1860 Nov  7 11:05 dakota_Ponly_list.in
===
David F. Robinson, Ph.D.
President - Corvid Technologies
704.799.6944 x101 [office]
704.252.1310 [cell]
704.799.7974 [fax]
david.robin...@corvidtec.com 
http://www.corvidtechnologies.com


___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://supercolony.gluster.org/mailman/listinfo/gluster-devel


Hi David,

Can you please provide the log files? You can find them in 
/var/log/glusterfs.


Regards,
Raghavendra Bhat
___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://supercolony.gluster.org/mailman/listinfo/gluster-devel


[Gluster-devel] Snapshot and Data Tiering

2014-12-18 Thread Joseph Fernandes
Hi All, 

These are the MOM of the snapshot and data tiering interops meet (apologies for 
the late update)

1) USS should not have problems with the changes made in DHT (DHT over DHT), as 
USS xlator sits above DHT.
2) With the introduction of the heat capturing DB we have few things to take 
care off, when a snapshot of the brick is taken
   a. Location of the sqlite3 files:  Today the location of the sqlite3 files 
by default reside in the brick (brick_path/.glusterfs/)
  this make taking the snapshot of the db easier as it is done via LVM 
along with the brick. If the location is outside the brick(which is configurable
  eg: have all the DB files in SSD for better performance), then during 
taking a snapshot glusterd needs to take a manual backup of these files,
  which would take some time and the gluster CLI would timeout. So for the 
first cut we would have the DB files in the brick itself, until we have a 
solution for 
  CLI timeout. 
   b. Type of the DataBase: For the first cut we are considering only sqlite3. 
And sqlite3 works excellent with LVM snapshots. If a new DB type like leveldb 
is 
  introduced in the future, we need to investigate on its compatibility 
with LVM snapshots. And this might be a deciding factor to have such a DB type 
in gluster.
   c. Check-pointing the Sqlite3 DB: Before taking a snapshot, Glusterd should 
issue a checkpoint command to the Sqlite3 DB to flush all the db cache on to 
the Disk.
  Action item on Data Tiering team: 
 1) To give the time taken to do so. i.e checkpointing time 
 2) Provide a generic API in libgfdb to do so OR handle the 
CTR xlator notification from glusterd to do checkpointing
  Action item on snapshot team :
 1) provide hooks to call the generic API OR do the 
brick-ops to notify the CTR Xlator
   d. Snapshot aware bricks: For a brick belonging to a snapshot the CTR xlator 
should not record reads (which come from USS). Solution 
 1) send CTR Xlator notification after the snapshot brick 
is started to turn off recording 
 2) OR While the snapshot brick is started by glusterd pass 
a option marking the brick to be apart of snapshot. This is more generic 
solution.
3) The snapshot restore problem : When a snapshot is restored,
1)  it will bring the volume to the point-in-time state i.e 
for example 
The current state of the volume is,  HOT tier has 50 % 
of data & COLD tier has 50 % of data. And the snapshot has the volume in the 
state 
HOT Tier has  20 % of data & COLD tier has 80 % of 
data. A restore will bring the volume to HOT:20% COLD:80%. i.e it will undo all 
the 
promotions and demotions. This should be mentioned in 
the documentation.
2) In addition to this, since the restored DB has time 
recorded in the past, File that were considered HOT in the past are now COLD. 
This will have   
   all the data moved to the COLD tier if an data tiering 
scanner runs after the restore of the snapshot. This should be recorded in the 
  documentation as a recommendation that not to run the 
data tiering scanner immediately after a restore of snapshot. The System should 
be given 
  time to learn the new heat patterns. The learning time 
depends on nature of work load.  
4) During a data tiering activity snapshot activities like create/restore 
should be disables, just as it is done during adding and removing of the brick, 
which leads to a rebalance. 

Let me know if anything else is missing or any correction are required.

Regards,
Joe
___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://supercolony.gluster.org/mailman/listinfo/gluster-devel


Re: [Gluster-devel] AFR conservative merge portability

2014-12-18 Thread Ravishankar N
On 12/19/2014 08:23 AM, Emmanuel Dreyfus wrote:
> Ravishankar N  wrote:
> 
>> Point #1 would be addressed by your patch with some modifications (pending
>> review );
> 
> I addressed the points you raised but now my patch is failing just newly
> introduced ./tests/bugs/afr-quota-xattr-mdata-heal.t 
> 
> See there:
> http://build.gluster.org/job/rackspace-regression-2GB-triggered/3215/console
> Some help would be welcome on that front.
> 

There seems to be one more catch. afr_is_dirtime_splitbrain() only compares 
equality of type,gfid, mode, uid, gid. We need to check if application set 
xattrs are equal as well.

mkdir /mnt/dir
kill brick0
setfattr -n user.attr1 -v value1 /mnt/dir
kill brick1, bring up brick0
sleep 10
touch /mnt/dir
bring both bricks up.

Now metadataheal mustn't be triggered.

-Ravi
___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://supercolony.gluster.org/mailman/listinfo/gluster-devel


Re: [Gluster-devel] GlusterFS Volume backup API

2014-12-18 Thread Joseph Fernandes
Few concerns inline JOE>>

- Original Message -
From: "Aravinda" 
To: "gluster Devel" 
Sent: Thursday, December 18, 2014 10:38:20 PM
Subject: [Gluster-devel] GlusterFS Volume backup API

Hi, 


Today we discussed about GlusterFS backup API, our plan is to provide a 
tool/api to get list of changed files(Full/incremental) 

Participants: Me, Kotresh, Ajeet, Shilpa 

Thanks to Paul Cuzner for providing inputs about pre and post hooks available 
in backup utilities like NetBackup. 


Initial draft: 
== 

Case 1 - Registered Consumer 
 

Consumer application has to register by giving a session name. 

glusterbackupapi register



When the following command run for the first time, it will do full scan. next 
onwards it does incremental. Start time for incremental is last backup time, 
endtime will be current time. 

glusterbackupapi  --out-file=out.txt 

--out-file is optional argument, default output file name is `output.txt`. 
Output file will have file paths. 



Case 2 - Unregistered Consumer 
- 

Start time and end time information will not be remembered, every time consumer 
has to send start time and end time if incremental. 

For Full backup, 

glusterbackupapi full   --out-file=out.txt 

For Incremental backup, 

glusterbackupapi inc --out-file=out.txt 

where STARTTIME and ENDTIME are in unix timestamp format. 


Technical overview 
== 
1. Using host and volume name arguments, it fetches volume info and volume 
status to get the list of up bricks/nodes. 
2. Executes brick/node agent to get required details from brick. (TBD: 
communication via RPC/SSH/gluster system:: execute) 
3. If full scan, brick/node agent will gets list of files from that brick 
backend and generates output file. 
4. If incremental, it calls Changelog History API, gets distinct GFID's list 
and then converts each GFID to path. 
5. Generated output files from each brick node will be copied to initiator 
node. 
6. Merges all the output files from bricks and removes duplicates. 
7. In case of session based access, session information will be saved by each 
brick/node agent. 


Issues/Challenges 
= 
1. If timestamp different in gluster nodes. We are assuming, in a cluster TS 
will remain same. 
2. If a brick is down, how to handle? We are assuming, all the bricks should be 
up to initiate backup(atleast one from each replica) 
3. If changelog not available, or broken in between start time and end time, 
then how to get the incremental files list. As a prerequisite, changelog should 
be enabled before backup. 

JOE >> Performance overhead on IO path when changelog is switched on. I think 
getting numbers or a performance matrix here would be very crucial,
as its not desirable to sacrifice on File IO performance to support Backup API 
or any data maintenance activity. 

4. GFID to path conversion, using `find -samefile` or using 
`glusterfs.pathinfo` xattr on aux-gfid-mount. 
5. Deleted files, if we get GFID of a deleted file from changelog how to find 
path. Do backup api requires deleted files list? 

JOE >> 
1) "find" would not be a good option here as you have to traverse through the 
whole namespace. Takes a toll on the spindle based media.
2) "glusterfs.pathinfo" xattr is a feasible approach but has its own problems,
a. This xattr comes only with quota, So you need to decouple it from quota.
b. This xattr should be enabled from the beginning of namespace i.e if 
enable later you will some file which will
   have this xattr and some which wont have it. This issue is true for any 
meta storing approach in gluster for eg : DB, Changelog etc
c. I am not sure if this xattr has a support for multiple had links. I am 
not sure if you (the backup scenario) would require it or not.
   Just food for thought. 
d. This xattr is not crash consistent with power failures. That means you 
may be in a state where few inodes will have the xattr and few won't. 
3) Agree with the delete problem. This problem gets worse with multiple hard 
links. If some hard links are recorded and few are not recorded.

6. Storing session info in each brick nodes. 
7. Communication channel between nodes, RPC/SSH/gluster system:: execute... 
etc? 


Kotresh, Ajeet, Please add if I missed any points. 


-- 
regards 
Aravinda 

___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://supercolony.gluster.org/mailman/listinfo/gluster-devel
___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://supercolony.gluster.org/mailman/listinfo/gluster-devel


Re: [Gluster-devel] Volume management proposal (4.0)

2014-12-18 Thread Krishnan Parthasarathi
> It seems simplest to store child-parent relationships (one to one)
> instead of parent-child relationships (one to many).  Based on that, I
> looked at some info files and saw that we're already using
> "parent_volname" for snapshot stuff.  Maybe we need to change
> terminology.  Let's say that we use "part-of" in the info file.

The above persistence scheme makes querying for volumes affected by a change
to a given volume, linear in length of path from given volume to the primary 
volume,
the 'root', in the graph of volumes. The alternative would involve going
through every volume to check if the volume changed affects it. This is linear
in no. of volumes in the cluster. This computational complexity makes me favour
storing child-parent relationships in the secondary volumes. The only down-side
is that we need to 'lock-down' secondary volumes from being modified. I don't
have a way to measure the effect (yet) this would have on (concurrent) 
modifications
on secondary volumes of a given primary volume.
> 
> * Create a new string-valued glusterd_volinfo_t.part_of field.
> 
> * This gets filled in from glusterd_store_update_volinfo along with
>   everything else from the info file.
> 
> * When a composite volume is created, its component volumes' info files
>   are rewritten.
> 
> * When a component volume is modified, use the part_of field to find its
>   parent.  We then generate the fully-resolved client volfiles before
>   and after the change and compare for differences.
> 
> * If we find differences in the parent, process the change as though it
>   had been made on the parent (triggering graph switches etc.) and then
>   use the parent's part_of field to repeat the process one level up.
> 
> I don't think we need to do anything for server-side-only changes, since
> those will already be handled (e.g. starting new bricks) by the existing
> infrastructure.  However, things like NFS and quotad might need to go
> through the same process outlined above for clients.

This makes sense.

___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://supercolony.gluster.org/mailman/listinfo/gluster-devel


Re: [Gluster-devel] AFR conservative merge portability

2014-12-18 Thread Emmanuel Dreyfus
Ravishankar N  wrote:

> Point #1 would be addressed by your patch with some modifications (pending
> review );

I addressed the points you raised but now my patch is failing just newly
introduced ./tests/bugs/afr-quota-xattr-mdata-heal.t 

See there:
http://build.gluster.org/job/rackspace-regression-2GB-triggered/3215/console
Some help would be welcome on that front.

-- 
Emmanuel Dreyfus
http://hcpnet.free.fr/pubz
m...@netbsd.org
___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://supercolony.gluster.org/mailman/listinfo/gluster-devel


Re: [Gluster-devel] Updates to operating-version

2014-12-18 Thread James
On Thu, Dec 18, 2014 at 2:30 AM, Kaushal M  wrote:
> In that case, I should send a note as the op-version has been bumped
> for the master branch.
>
> Please take note,
> The operating-version for the master branch has been bumped to
> '30700', which is aligned with the next release of GlusterFS, 3.7.

Cool, thanks.
As reference, the four line patch looks like:

https://github.com/purpleidea/puppet-gluster/commit/c2291084cf818d0058a66dcbc0984bcea7b51252

and is now in git master. Future patches are welcome :)

Cheers,
James
___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://supercolony.gluster.org/mailman/listinfo/gluster-devel


Re: [Gluster-devel] Updates to operating-version

2014-12-18 Thread James
On Thu, Dec 18, 2014 at 11:40 AM, Joe Julian  wrote:
> James, why not just compute the operating version? After 3.5.0 it's always
> XYYZZ based on the version.
>
> Something along the lines of
>
> $version_array = split("${gluster_version}", '[.]')
> if $version_array[0] < 3 {
>   fail("Unsupported GlusterFS Version")
> }
> $operating_version = $version_array[2] ? {
>   '4' => '2',
>   '5' => $version_array[3] ? {
> '0' => '3',
> default => sprintf("%d%02d%02d", $version_array),
> },
>   default => sprintf("%d%02d%02d", $version_array),
> }
>
>
> Perhaps a CLI command to fetch the GD_OP_VERSION_MAX might be beneficial as
> well.

This is a very good point actually... In fact, it begs the question:
If it can be computed from the version string, why doesn't GlusterFS
do this internally in libglusterfs/src/globals.h ?
I'm guessing perhaps there's a reason your computation isn't always correct...

Since that's not the case, I figured I'd just match whatever Gluster
is doing by actually storing the values in a yaml (hiera) "table". For
now I think it's fine, but if someone has better information, lmk!

>
>
> On 12/17/2014 11:30 PM, Kaushal M wrote:
>>
>> In that case, I should send a note as the op-version has been bumped
>> for the master branch.
>>
>> Please take note,
>> The operating-version for the master branch has been bumped to
>> '30700', which is aligned with the next release of GlusterFS, 3.7.
>>
>> ~kaushal
>>
>> On Thu, Dec 18, 2014 at 12:49 PM, Lalatendu Mohanty 
>> wrote:
>>>
>>> On 12/17/2014 07:39 PM, Niels de Vos wrote:

 On Wed, Dec 17, 2014 at 08:40:18AM -0500, James wrote:
>
> Hello,
>
> If you plan on updating the operating-version value of GlusterFS,
> please
> either ping me (@purpleidea) or send a patch to puppet-gluster [1].
> Patches are 4 line yaml files, and you don't need any knowledge of
> puppet or yaml to do so.
>
> Example:
>
> +# gluster/data/versions/3.6.yaml
> +---
> +gluster::versions::operating_version: '30600' # v3.6.0
> +# vim: ts=8
>
> As seen at:
>
>
>
> https://github.com/purpleidea/puppet-gluster/commit/43c60d2ddd6f57d2117585dc149de6653bdabd4b#diff-7cb3f60a533975d869ffd4a772d66cfeR1
>
> Thanks for your cooperation! This will ensure puppet-gluster can always
> correctly work with new versions of GlusterFS.

 How about you post a patch that adds this request as a comment in the
 glusterfs sources (libglusterfs/src/globals.h)?

 Or, maybe this should be noted on some wiki page, and have the comment
 point to the wiki instead. Maybe other projects start to use the
 op-version in future too, and they also need to get informed about a
 change.

>>> IMO we should make it a practice to send a mail to gluster-devel whenever
>>> a
>>> patch is sent to increase the operating-version. Similar to practice what
>>> Fedora follows for so version bump.
>>>
>>> -Lala
>>> ___
>>> Gluster-devel mailing list
>>> Gluster-devel@gluster.org
>>> http://supercolony.gluster.org/mailman/listinfo/gluster-devel
>>
>> ___
>> Gluster-devel mailing list
>> Gluster-devel@gluster.org
>> http://supercolony.gluster.org/mailman/listinfo/gluster-devel
>
>
> ___
> Gluster-devel mailing list
> Gluster-devel@gluster.org
> http://supercolony.gluster.org/mailman/listinfo/gluster-devel
___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://supercolony.gluster.org/mailman/listinfo/gluster-devel


Re: [Gluster-devel] Updates to operating-version

2014-12-18 Thread James
On Wed, Dec 17, 2014 at 9:09 AM, Niels de Vos  wrote:
> On Wed, Dec 17, 2014 at 08:40:18AM -0500, James wrote:
>
> How about you post a patch that adds this request as a comment in the
> glusterfs sources (libglusterfs/src/globals.h)?

Good idea actually... Please review/ack/merge :)

http://review.gluster.org/#/c/9301/

>
> Or, maybe this should be noted on some wiki page,
Already updated the wiki yesterday...

https://www.gluster.org/community/documentation/index.php/OperatingVersions

> and have the comment
> point to the wiki instead. Maybe other projects start to use the
> op-version in future too, and they also need to get informed about a
> change.

If that becomes the case, we can change this :)
See a comment about this in my next email...

Thanks!
James

>
> Thanks,
> Niels
___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://supercolony.gluster.org/mailman/listinfo/gluster-devel


[Gluster-devel] GlusterFS Volume backup API

2014-12-18 Thread Aravinda

Hi,


Today we discussed about GlusterFS backup API, our plan is to provide a 
tool/api to get list of changed files(Full/incremental)


Participants: Me, Kotresh, Ajeet, Shilpa

Thanks to Paul Cuzner for providing inputs about pre and post hooks 
available in backup utilities like NetBackup.


*
**Initial draft:*
==

Case 1 - Registered Consumer


Consumer application has to register by giving a session name.

glusterbackupapi register   



When the following command run for the first time, it will do full scan. 
next onwards it does incremental. Start time for incremental is last 
backup time, endtime will be current time.


glusterbackupapi  --out-file=out.txt

--out-file is optional argument, default output file name is 
`output.txt`. Output file will have file paths.




Case 2 - Unregistered Consumer
-

Start time and end time information will not be remembered, every time 
consumer has to send start time and end time if incremental.


For Full backup,

glusterbackupapi full   --out-file=out.txt

For Incremental backup,

glusterbackupapi inc 
--out-file=out.txt


where STARTTIME and ENDTIME are in unix timestamp format.


*Technical overview*
==
1. Using host and volume name arguments, it fetches volume info and 
volume status to get the list of up bricks/nodes.
2. Executes brick/node agent to get required details from brick. (TBD: 
communication via RPC/SSH/gluster system:: execute)
3. If full scan, brick/node agent will gets list of files from that 
brick backend and generates output file.
4. If incremental, it calls Changelog History API, gets distinct GFID's 
list and then converts each GFID to path.
5. Generated output files from each brick node will be copied to 
initiator node.

6. Merges all the output files from bricks and removes duplicates.
7. In case of session based access, session information will be saved by 
each brick/node agent.



*Issues/Challenges*
=
1. If timestamp different in gluster nodes. We are assuming, in a 
cluster TS will remain same.
2. If a brick is down, how to handle? We are assuming, all the bricks 
should be up to initiate backup(atleast one from each replica)
3. If changelog not available, or broken in between start time and end 
time, then how to get the incremental files list. As a prerequisite, 
changelog should be enabled before backup.
4. GFID to path conversion, using `find -samefile` or using 
`glusterfs.pathinfo` xattr on aux-gfid-mount.
5. Deleted files, if we get GFID of a deleted file from changelog how to 
find path. Do backup api requires deleted files list?

6. Storing session info in each brick nodes.
7. Communication channel between nodes, RPC/SSH/gluster system:: 
execute... etc?



 Kotresh, Ajeet, Please add if I missed any points.


 --
 regards
 Aravinda
___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://supercolony.gluster.org/mailman/listinfo/gluster-devel


Re: [Gluster-devel] Updates to operating-version

2014-12-18 Thread Joe Julian

Or maybe just add that to the version string.

On 12/18/2014 08:40 AM, Joe Julian wrote:
Perhaps a CLI command to fetch the GD_OP_VERSION_MAX might be 
beneficial as well. 


___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://supercolony.gluster.org/mailman/listinfo/gluster-devel


Re: [Gluster-devel] Updates to operating-version

2014-12-18 Thread Joe Julian
James, why not just compute the operating version? After 3.5.0 it's 
always XYYZZ based on the version.


Something along the lines of

$version_array = split("${gluster_version}", '[.]')
if $version_array[0] < 3 {
  fail("Unsupported GlusterFS Version")
}
$operating_version = $version_array[2] ? {
  '4' => '2',
  '5' => $version_array[3] ? {
'0' => '3',
default => sprintf("%d%02d%02d", $version_array),
},
  default => sprintf("%d%02d%02d", $version_array),
}


Perhaps a CLI command to fetch the GD_OP_VERSION_MAX might be beneficial 
as well.


On 12/17/2014 11:30 PM, Kaushal M wrote:

In that case, I should send a note as the op-version has been bumped
for the master branch.

Please take note,
The operating-version for the master branch has been bumped to
'30700', which is aligned with the next release of GlusterFS, 3.7.

~kaushal

On Thu, Dec 18, 2014 at 12:49 PM, Lalatendu Mohanty  wrote:

On 12/17/2014 07:39 PM, Niels de Vos wrote:

On Wed, Dec 17, 2014 at 08:40:18AM -0500, James wrote:

Hello,

If you plan on updating the operating-version value of GlusterFS, please
either ping me (@purpleidea) or send a patch to puppet-gluster [1].
Patches are 4 line yaml files, and you don't need any knowledge of
puppet or yaml to do so.

Example:

+# gluster/data/versions/3.6.yaml
+---
+gluster::versions::operating_version: '30600' # v3.6.0
+# vim: ts=8

As seen at:


https://github.com/purpleidea/puppet-gluster/commit/43c60d2ddd6f57d2117585dc149de6653bdabd4b#diff-7cb3f60a533975d869ffd4a772d66cfeR1

Thanks for your cooperation! This will ensure puppet-gluster can always
correctly work with new versions of GlusterFS.

How about you post a patch that adds this request as a comment in the
glusterfs sources (libglusterfs/src/globals.h)?

Or, maybe this should be noted on some wiki page, and have the comment
point to the wiki instead. Maybe other projects start to use the
op-version in future too, and they also need to get informed about a
change.


IMO we should make it a practice to send a mail to gluster-devel whenever a
patch is sent to increase the operating-version. Similar to practice what
Fedora follows for so version bump.

-Lala
___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://supercolony.gluster.org/mailman/listinfo/gluster-devel

___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://supercolony.gluster.org/mailman/listinfo/gluster-devel


___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://supercolony.gluster.org/mailman/listinfo/gluster-devel


Re: [Gluster-devel] Readdir d_off encoding

2014-12-18 Thread Xavier Hernandez

On 12/18/2014 05:11 PM, Shyam wrote:

On 12/17/2014 05:04 AM, Xavier Hernandez wrote:

Just to consider all possibilities...

Current architecture needs to create all directory structure on all
bricks, and has the big problem that each directory in each brick will
store the files in different order and with different d_off values.


I gather that this is when EC or AFR is in place, as for DHT a file is
on one brick only.


Files are only in one brick, but directories are in all bricks. This is 
independent fram having ec or afr in place.


This makes directory access quite complex in some cases. For example if 
a readdir is made on one brick and that brick dies, the next readdir 
cannot be continued on another brick, at least without having to do some 
complex handling. This is the consequence of having a directory on each 
brick like if they were replicated, but this directories are not exactly 
equal.


Also this architecture forces ec to have directories replicated. This 
adds complexities






This is a serious scalability issue and have many inconveniences when
trying to heal or detect inconsistencies between bricks (basically we
would need to read full directory contents of each brick to compare
them).


I am not quite familiar with EC so pardon the ignorance.
Why/How does d_off play a role in this healing/crawling?


This problem is also present in afr. There are two easy to see problems:

* If multiple readdir requests are needed to get full contents of a 
directory and the brick to which the requests are being sent dies, the 
next readdir request cannot be sent to any other brick because the d_off 
field won't make sense on the other brick. This doesn't have an easy 
solution, so an error is returned instead to complete the directory 
listing. This is odd because in theory we have the directory replicated 
and this should happen (the same scenario but reading from a file is 
handled transparently to the client).


* If you need to detect the differences between the directory contents 
on different bricks (for example when you want to heal it), you will 
need to read full contents of the directory from each brick in memory, 
sort each list, and begin comparison. If that directory contains, for 
example, one million entries, that would need a huge amount of memory 
for an operation that seem more simple. If all bricks would return 
directory entries in the same order and same d_off, this procedure would 
need a lot less of memory and would be more efficient.






An alternative would be to convert directories into regular files from
the brick point of view.

The benefits of this would be:

* d_off would be controlled by gluster, so all bricks would have the
same d_off and order. No need to use any d_off mapping or transformation.

* Directories could take advantage of replication and disperse self-heal
procedures. They could be treated as files and be healed more easily. A
corrupted brick would not produce invalid directory contents, and file
duplication in directory listing would be avoided.

* Many of the complexities in DHT, AFR and EC to manage directories
would be removed.

The main issue could be the need of an upper level xlator that would
transform directory requests into file modifications and would be
responsible of managing all d_off assignment and directory manipulation
(renames, links, unlinks, ...).


This is tending towards some thoughts for Gluster 4.0 and specifically
DHT in 4.0. I am going to wait for the same/similar comments as we
discuss those specifics (hopefully published before Christmas (2014)).


___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://supercolony.gluster.org/mailman/listinfo/gluster-devel


Re: [Gluster-devel] Readdir d_off encoding

2014-12-18 Thread Shyam

On 12/17/2014 05:04 AM, Xavier Hernandez wrote:

Just to consider all possibilities...

Current architecture needs to create all directory structure on all
bricks, and has the big problem that each directory in each brick will
store the files in different order and with different d_off values.


I gather that this is when EC or AFR is in place, as for DHT a file is 
on one brick only.




This is a serious scalability issue and have many inconveniences when
trying to heal or detect inconsistencies between bricks (basically we
would need to read full directory contents of each brick to compare them).


I am not quite familiar with EC so pardon the ignorance.
Why/How does d_off play a role in this healing/crawling?



An alternative would be to convert directories into regular files from
the brick point of view.

The benefits of this would be:

* d_off would be controlled by gluster, so all bricks would have the
same d_off and order. No need to use any d_off mapping or transformation.

* Directories could take advantage of replication and disperse self-heal
procedures. They could be treated as files and be healed more easily. A
corrupted brick would not produce invalid directory contents, and file
duplication in directory listing would be avoided.

* Many of the complexities in DHT, AFR and EC to manage directories
would be removed.

The main issue could be the need of an upper level xlator that would
transform directory requests into file modifications and would be
responsible of managing all d_off assignment and directory manipulation
(renames, links, unlinks, ...).


This is tending towards some thoughts for Gluster 4.0 and specifically 
DHT in 4.0. I am going to wait for the same/similar comments as we 
discuss those specifics (hopefully published before Christmas (2014)).




Xavi

On 12/16/2014 03:06 AM, Anand Avati wrote:

Replies inline

On Mon Dec 15 2014 at 12:46:41 PM Shyam mailto:srang...@redhat.com>> wrote:

With the changes present in [1] and [2],

A short explanation of the change would be, we encode the subvol
ID in
the d_off, losing 'n + 1' bits in case the high order n+1 bits of the
underlying xlator returned d_off is not free. (Best to read the
commit
message for [1] :) )

Although not related to the latest patch, here is something to
consider
for the future:

We now have DHT, AFR, EC(?), DHT over DHT (Tier) which need subvol
encoding in the returned readdir offset. Due to this, the loss in
bits
_may_ cause unwanted offset behavior, when used in the current
scheme.
As we would end up eating more bits than what we do at present.

Or IOW, we could be invalidating the assumption "both EXT4/XFS are
tolerant in terms of the accuracy of the value presented
back in seekdir().


XFS has not been a problem, since it always returns 32bit d_off. With
Ext4, it has been noted that it is tolerant to sacrificing the lower
bits in accuracy.

i.e, a seekdir(val) actually seeks to the entry which
has the "closest" true offset."

Should we reconsider an in memory _cookie_ like approach that can
help
in this case?

It would invalidate (some or all based on the implementation) the
following constraints that the current design resolves, (from, [1])
- Nothing to "remember in memory" or evict "old entries".
- Works fine across NFS server reboots and also NFS head failover.
- Tolerant to seekdir() to arbitrary locations.

But, would provide a more reliable readdir offset for use (when valid
and not evicted, say).

How would NFS adapt to this? Does Ganesha need a better scheme when
doing multi-head NFS fail over?


Ganesha just offloads the responsibility to the FSAL layer to give
stable dir cookies (as it rightly should)


Thoughts?


I think we need to analyze the actual assumption/problem here.
Remembering things in memory comes with the limitations you note above,
and may after all, still not be necessary. Let's look at the two
approaches taken:

- Small backend offsets: like XFS, the offsets fit in 32bits, and we are
left with another 32bits of freedom to encode what we want. There is no
problem here until our nested encoding requirements cross 32bits of
space. So let's ignore this for now.

- Large backend offsets: Ext4 being the primary target. Here we observe
that the backend filesystem is tolerant to sacrificing the accuracy of
lower bits. So we overwrite the lower bits with our subvolume encoding
information, and the number of bits used to encode is implicit in the
subvolume cardinality of that translator. While this works fine with a
single transformation, it is clearly a problem when the transformation
is nested with the same algorithm. The reason is quite simple: while the
lower bits were disposable when the cookie was taken fresh from Ext4,
once transformed the same lower bits are now "holy" and cannot be
overwritten carelessly, at least without dire consequences. The higher
level xlators n

Re: [Gluster-devel] Volume management proposal (4.0)

2014-12-18 Thread Jeff Darcy
> Persisting the volume relationships in the volume info file
> (i.e, /var/lib/glusterd/vols/VOLNAME/info) is a good idea. With this we could
> contain volume (relationship) management within the management plane.
> Do you have ideas on how to persist volume relationships?

It seems simplest to store child-parent relationships (one to one)
instead of parent-child relationships (one to many).  Based on that, I
looked at some info files and saw that we're already using
"parent_volname" for snapshot stuff.  Maybe we need to change
terminology.  Let's say that we use "part-of" in the info file.

* Create a new string-valued glusterd_volinfo_t.part_of field.

* This gets filled in from glusterd_store_update_volinfo along with
  everything else from the info file.

* When a composite volume is created, its component volumes' info files
  are rewritten.

* When a component volume is modified, use the part_of field to find its
  parent.  We then generate the fully-resolved client volfiles before
  and after the change and compare for differences.

* If we find differences in the parent, process the change as though it
  had been made on the parent (triggering graph switches etc.) and then
  use the parent's part_of field to repeat the process one level up.

I don't think we need to do anything for server-side-only changes, since
those will already be handled (e.g. starting new bricks) by the existing
infrastructure.  However, things like NFS and quotad might need to go
through the same process outlined above for clients.
___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://supercolony.gluster.org/mailman/listinfo/gluster-devel


Re: [Gluster-devel] Volume management proposal (4.0)

2014-12-18 Thread Krishnan Parthasarathi
> > IIUC, (E) describes that primary volume file would be generated with all
> > secondary volume references resolved. Wouldn't that preclude the
> > possibility
> > of the respective processes discovering the dependencies?
> 
> That's not entirely clear.  The *volfiles* might not contain the necessary
> information, depending on exactly when we resolve references to other
> volumes - e.g. at creation, volfile-fetch, or even periodically.  However,

I assumed from (E) that volfiles of the primary volumes were generated at the
time of volume creation, where all references to constiuent secondary volumes
would be resolved. Yes, there are other places where this could be done giving
processes a chance to detect changes deeper in the graph (of volumes).

> the *info* files from which the volfiles are generated would necessarily
> include at least the parent-to-child links.  We could scan all info files
> during any configuration change to find such dependencies.  Even better,
> we could also store the child-to-parent links as well, so when we change
> X we *immediately* know whether that might affect a parent volume.
Persisting the volume relationships in the volume info file
(i.e, /var/lib/glusterd/vols/VOLNAME/info) is a good idea. With this we could
contain volume (relationship) management within the management plane.
Do you have ideas on how to persist volume relationships?

___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://supercolony.gluster.org/mailman/listinfo/gluster-devel


Re: [Gluster-devel] AFR conservative merge portability

2014-12-18 Thread Ravishankar N
On 12/18/2014 01:28 PM, Emmanuel Dreyfus wrote:
> On Mon, Dec 15, 2014 at 03:21:24PM -0500, Jeff Darcy wrote:
>> Is there *any* case, not even necessarily involving conservative merge,
>> where it would be harmful to propagate the latest ctime/mtime for any
>> replica of a directory?
> 
> In case of conservative merge, the problem vanish on its own anyway:
> adding entries updates parent directory ctime/mtime and the reported
> split brain does not exists anymore.
> 
> Here is a first attempt, please comment:
> http://review.gluster.org/9291
> 

Hi Emmanuel,
So we (AFR team) had a discussion and came up with two things that need to be 
done w.r.t. this issue:

1. First in metadata heal, if the metadata SB is only due to [am]time, heal the 
file choosing the source as the one having the max of atime/mtime.

2. Currently in entry-self heal, after conservative merge, the dir's timestamp 
is updated using the time when self heal happened and not that of the dirs on 
the bricks.
   This needs to be changed to use the timestamp of the source having max 
mtime, similar to what data selfheal does in afr_selfheal_data_restore_time()

Point #1 would be addressed by your patch with some modifications (pending 
review );that just leaves #2 to be done.

Thanks,
Ravi
___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://supercolony.gluster.org/mailman/listinfo/gluster-devel