Re: [Gluster-devel] distributed files/directories and [cm]time updates

2016-01-26 Thread Venky Shankar
On Tue, Jan 26, 2016 at 08:51:37AM +0100, Xavier Hernandez wrote:
> Hi Pranith,
> 
> On 26/01/16 03:47, Pranith Kumar Karampuri wrote:
> >hi,
> >   Traditionally gluster has been using ctime/mtime of the
> >files/dirs on the bricks as stat output. Problem we are seeing with this
> >approach is that, software which depends on it gets confused when there
> >are differences in these times. Tar especially gives "file changed as we
> >read it" whenever it detects ctime differences when stat is served from
> >different bricks. The way we have been trying to solve it is to serve
> >the stat structures from same brick in afr, max-time in dht. But it
> >doesn't avoid the problem completely. Because there is no way to change
> >ctime at the moment(lutimes() only allows mtime, atime), there is little
> >we can do to make sure ctimes match after self-heals/xattr
> >updates/rebalance. I am wondering if anyone of you solved these problems
> >before, if yes how did you go about doing it? It seems like applications
> >which depend on this for backups get confused the same way. The only way
> >out I see it is to bring ctime to an xattr, but that will need more iops
> >and gluster has to keep updating it on quite a few fops.
> 
> I did think about this when I was writing ec at the beginning. The idea was
> that the point in time at which each fop is executed were controlled by the
> client by adding an special xattr to each regular fop. Of course this would
> require support inside the storage/posix xlator. At that time, adding the
> needed support to other xlators seemed too complex for me, so I decided to
> do something similar to afr.
> 
> Anyway, the idea was like this: for example, when a write fop needs to be
> sent, dht/afr/ec sets the current time in a special xattr, for example
> 'glusterfs.time'. It can be done in a way that if the time is already set by
> a higher xlator, it's not modified. This way DHT could set the time in fops
> involving multiple afr subvolumes. For other fops, would be afr who sets the
> time. It could also be set directly by the top most xlator (fuse), but that
> time could be incorrect because lower xlators could delay the fop execution
> and reorder it. This would need more thinking.
> 
> That xattr will be received by storage/posix. This xlator will determine
> what times need to be modified and will change them. In the case of a write,
> it can decide to modify mtime and, maybe, atime. For a mkdir or create, it
> will set the times of the new file/directory and also the mtime of the
> parent directory. It depends on the specific fop being processed.
> 
> mtime, atime and ctime (or even others) could be saved in a special posix
> xattr instead of relying on the file system attributes that cannot be
> modified (at least for ctime).
> 
> This solution doesn't require extra fops, So it seems quite clean to me. The
> additional I/O needed in posix could be minimized by implementing a metadata
> cache in storage/posix that would read all metadata on lookup and update it
> on disk only at regular intervals and/or on invalidation. All fops would
> read/write into the cache. This would even reduce the number of I/O we are
> currently doing for each fop.

That's exatly the route taken by DHTv2, although including other metadata such
as size, type, etc. For current gluster model, the above approach is clean.

However, with DHTv2, metadata is stored separately from data, in MDS (or rather
MDC: metadata cluster). The tricky part with MDS/DS split is to maintain 
filesystem
metadata consistency, especially object size after hard reboot of node(s).

Interested folks (in DHTv2 or generally) may take a look at this document (just 
a
initial draft):

   https://review.gerrithub.io/#/c/253517/

> 
> Xavi
> ___
> Gluster-devel mailing list
> Gluster-devel@gluster.org
> http://www.gluster.org/mailman/listinfo/gluster-devel
___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-devel


Re: [Gluster-devel] distributed files/directories and [cm]time updates

2016-01-26 Thread Xavier Hernandez

Hi Joseph,

On 26/01/16 10:42, Joseph Fernandes wrote:

Hi Xavi,

Answer inline:

- Original Message -
From: "Xavier Hernandez" <xhernan...@datalab.es>
To: "Joseph Fernandes" <josfe...@redhat.com>
Cc: "Pranith Kumar Karampuri" <pkara...@redhat.com>, "Gluster Devel" 
<gluster-devel@gluster.org>
Sent: Tuesday, January 26, 2016 2:09:43 PM
Subject: Re: [Gluster-devel] distributed files/directories and [cm]time updates

Hi Joseph,

On 26/01/16 09:07, Joseph Fernandes wrote:

Answer inline:


- Original Message -
From: "Xavier Hernandez" <xhernan...@datalab.es>
To: "Pranith Kumar Karampuri" <pkara...@redhat.com>, "Gluster Devel" 
<gluster-devel@gluster.org>
Sent: Tuesday, January 26, 2016 1:21:37 PM
Subject: Re: [Gluster-devel] distributed files/directories and [cm]time updates

Hi Pranith,

On 26/01/16 03:47, Pranith Kumar Karampuri wrote:

hi,
 Traditionally gluster has been using ctime/mtime of the
files/dirs on the bricks as stat output. Problem we are seeing with this
approach is that, software which depends on it gets confused when there
are differences in these times. Tar especially gives "file changed as we
read it" whenever it detects ctime differences when stat is served from
different bricks. The way we have been trying to solve it is to serve
the stat structures from same brick in afr, max-time in dht. But it
doesn't avoid the problem completely. Because there is no way to change
ctime at the moment(lutimes() only allows mtime, atime), there is little
we can do to make sure ctimes match after self-heals/xattr
updates/rebalance. I am wondering if anyone of you solved these problems
before, if yes how did you go about doing it? It seems like applications
which depend on this for backups get confused the same way. The only way
out I see it is to bring ctime to an xattr, but that will need more iops
and gluster has to keep updating it on quite a few fops.


I did think about this when I was writing ec at the beginning. The idea
was that the point in time at which each fop is executed were controlled
by the client by adding an special xattr to each regular fop. Of course
this would require support inside the storage/posix xlator. At that
time, adding the needed support to other xlators seemed too complex for
me, so I decided to do something similar to afr.

Anyway, the idea was like this: for example, when a write fop needs to
be sent, dht/afr/ec sets the current time in a special xattr, for
example 'glusterfs.time'. It can be done in a way that if the time is
already set by a higher xlator, it's not modified. This way DHT could
set the time in fops involving multiple afr subvolumes. For other fops,
would be afr who sets the time. It could also be set directly by the top
most xlator (fuse), but that time could be incorrect because lower
xlators could delay the fop execution and reorder it. This would need
more thinking.

That xattr will be received by storage/posix. This xlator will determine
what times need to be modified and will change them. In the case of a
write, it can decide to modify mtime and, maybe, atime. For a mkdir or
create, it will set the times of the new file/directory and also the
mtime of the parent directory. It depends on the specific fop being
processed.

mtime, atime and ctime (or even others) could be saved in a special
posix xattr instead of relying on the file system attributes that cannot
be modified (at least for ctime).

This solution doesn't require extra fops, So it seems quite clean to me.
The additional I/O needed in posix could be minimized by implementing a
metadata cache in storage/posix that would read all metadata on lookup
and update it on disk only at regular intervals and/or on invalidation.
All fops would read/write into the cache. This would even reduce the
number of I/O we are currently doing for each fop.


JOE: the idea of metadata cache is cool for read work loads, but for writes we

would end up doing double writes to the disk. i.e 1 for the actual write or 1 
to update the setxattr.
IMHO we cannot have it in a write back cache (periodic flush to disk) and 
ctime/mtime/atime data loss
or inconsistency will be a problem. Your thoughts?


If we want to have all in physical storage at all times, gluster will be
slow. We only need to be posix compliant, and posix allows some degree
of "inconsistency" here. i.e. we are not forced to write to physical
storage until the user application sends a flush or similar request.
Note that there are xlators that currently take advantage of this: for
example write-behind and md-cache.

Almost all file systems (if not all) rely on this to improve
performance, otherwise they would be really slow.

JOE : Agree


Of course this could cause a temporal inconsistency between bricks, but
since all cluster xlators (dht, afr and ec) use special xattrs to track
consistency, 

Re: [Gluster-devel] distributed files/directories and [cm]time updates

2016-01-26 Thread Joseph Fernandes
Hi Xavi,

Answer inline:

- Original Message -
From: "Xavier Hernandez" <xhernan...@datalab.es>
To: "Joseph Fernandes" <josfe...@redhat.com>
Cc: "Pranith Kumar Karampuri" <pkara...@redhat.com>, "Gluster Devel" 
<gluster-devel@gluster.org>
Sent: Tuesday, January 26, 2016 2:09:43 PM
Subject: Re: [Gluster-devel] distributed files/directories and [cm]time updates

Hi Joseph,

On 26/01/16 09:07, Joseph Fernandes wrote:
> Answer inline:
>
>
> - Original Message -
> From: "Xavier Hernandez" <xhernan...@datalab.es>
> To: "Pranith Kumar Karampuri" <pkara...@redhat.com>, "Gluster Devel" 
> <gluster-devel@gluster.org>
> Sent: Tuesday, January 26, 2016 1:21:37 PM
> Subject: Re: [Gluster-devel] distributed files/directories and [cm]time   
> updates
>
> Hi Pranith,
>
> On 26/01/16 03:47, Pranith Kumar Karampuri wrote:
>> hi,
>> Traditionally gluster has been using ctime/mtime of the
>> files/dirs on the bricks as stat output. Problem we are seeing with this
>> approach is that, software which depends on it gets confused when there
>> are differences in these times. Tar especially gives "file changed as we
>> read it" whenever it detects ctime differences when stat is served from
>> different bricks. The way we have been trying to solve it is to serve
>> the stat structures from same brick in afr, max-time in dht. But it
>> doesn't avoid the problem completely. Because there is no way to change
>> ctime at the moment(lutimes() only allows mtime, atime), there is little
>> we can do to make sure ctimes match after self-heals/xattr
>> updates/rebalance. I am wondering if anyone of you solved these problems
>> before, if yes how did you go about doing it? It seems like applications
>> which depend on this for backups get confused the same way. The only way
>> out I see it is to bring ctime to an xattr, but that will need more iops
>> and gluster has to keep updating it on quite a few fops.
>
> I did think about this when I was writing ec at the beginning. The idea
> was that the point in time at which each fop is executed were controlled
> by the client by adding an special xattr to each regular fop. Of course
> this would require support inside the storage/posix xlator. At that
> time, adding the needed support to other xlators seemed too complex for
> me, so I decided to do something similar to afr.
>
> Anyway, the idea was like this: for example, when a write fop needs to
> be sent, dht/afr/ec sets the current time in a special xattr, for
> example 'glusterfs.time'. It can be done in a way that if the time is
> already set by a higher xlator, it's not modified. This way DHT could
> set the time in fops involving multiple afr subvolumes. For other fops,
> would be afr who sets the time. It could also be set directly by the top
> most xlator (fuse), but that time could be incorrect because lower
> xlators could delay the fop execution and reorder it. This would need
> more thinking.
>
> That xattr will be received by storage/posix. This xlator will determine
> what times need to be modified and will change them. In the case of a
> write, it can decide to modify mtime and, maybe, atime. For a mkdir or
> create, it will set the times of the new file/directory and also the
> mtime of the parent directory. It depends on the specific fop being
> processed.
>
> mtime, atime and ctime (or even others) could be saved in a special
> posix xattr instead of relying on the file system attributes that cannot
> be modified (at least for ctime).
>
> This solution doesn't require extra fops, So it seems quite clean to me.
> The additional I/O needed in posix could be minimized by implementing a
> metadata cache in storage/posix that would read all metadata on lookup
> and update it on disk only at regular intervals and/or on invalidation.
> All fops would read/write into the cache. This would even reduce the
> number of I/O we are currently doing for each fop.
>
>>>>>>>>>> JOE: the idea of metadata cache is cool for read work loads, but for 
>>>>>>>>>> writes we
> would end up doing double writes to the disk. i.e 1 for the actual write or 1 
> to update the setxattr.
> IMHO we cannot have it in a write back cache (periodic flush to disk) and 
> ctime/mtime/atime data loss
> or inconsistency will be a problem. Your thoughts?

If we want to have all in physical storage at all times, gluster will be 
slow. We only need to be posix compliant, and posix allows some degree 
of "inconsistency" here. i.e. we are not forced to write to physical 
storage until the user appli

Re: [Gluster-devel] distributed files/directories and [cm]time updates

2016-01-26 Thread Joe Julian
If the time is set on a file by the client, this increases the critical 
complexity to include the clients whereas before it was only critical to 
have the servers time synced, now the clients should be as well.


Just spitballing here, but what if the time was converted at the posix 
layer as a difference between the current time and the file time and 
converted back somewhere in the client graph? Each server's file time 
would differ by the same amount to its current time [1] so it should be 
a consistent value between servers.



[1] depending on drift, but if the admin can't manage clocks, there's 
not much gluster could or should do about that.


On 01/26/2016 12:07 AM, Joseph Fernandes wrote:

Answer inline:


- Original Message -
From: "Xavier Hernandez" <xhernan...@datalab.es>
To: "Pranith Kumar Karampuri" <pkara...@redhat.com>, "Gluster Devel" 
<gluster-devel@gluster.org>
Sent: Tuesday, January 26, 2016 1:21:37 PM
Subject: Re: [Gluster-devel] distributed files/directories and [cm]time updates

Hi Pranith,

On 26/01/16 03:47, Pranith Kumar Karampuri wrote:

hi,
Traditionally gluster has been using ctime/mtime of the
files/dirs on the bricks as stat output. Problem we are seeing with this
approach is that, software which depends on it gets confused when there
are differences in these times. Tar especially gives "file changed as we
read it" whenever it detects ctime differences when stat is served from
different bricks. The way we have been trying to solve it is to serve
the stat structures from same brick in afr, max-time in dht. But it
doesn't avoid the problem completely. Because there is no way to change
ctime at the moment(lutimes() only allows mtime, atime), there is little
we can do to make sure ctimes match after self-heals/xattr
updates/rebalance. I am wondering if anyone of you solved these problems
before, if yes how did you go about doing it? It seems like applications
which depend on this for backups get confused the same way. The only way
out I see it is to bring ctime to an xattr, but that will need more iops
and gluster has to keep updating it on quite a few fops.

I did think about this when I was writing ec at the beginning. The idea
was that the point in time at which each fop is executed were controlled
by the client by adding an special xattr to each regular fop. Of course
this would require support inside the storage/posix xlator. At that
time, adding the needed support to other xlators seemed too complex for
me, so I decided to do something similar to afr.

Anyway, the idea was like this: for example, when a write fop needs to
be sent, dht/afr/ec sets the current time in a special xattr, for
example 'glusterfs.time'. It can be done in a way that if the time is
already set by a higher xlator, it's not modified. This way DHT could
set the time in fops involving multiple afr subvolumes. For other fops,
would be afr who sets the time. It could also be set directly by the top
most xlator (fuse), but that time could be incorrect because lower
xlators could delay the fop execution and reorder it. This would need
more thinking.

That xattr will be received by storage/posix. This xlator will determine
what times need to be modified and will change them. In the case of a
write, it can decide to modify mtime and, maybe, atime. For a mkdir or
create, it will set the times of the new file/directory and also the
mtime of the parent directory. It depends on the specific fop being
processed.

mtime, atime and ctime (or even others) could be saved in a special
posix xattr instead of relying on the file system attributes that cannot
be modified (at least for ctime).

This solution doesn't require extra fops, So it seems quite clean to me.
The additional I/O needed in posix could be minimized by implementing a
metadata cache in storage/posix that would read all metadata on lookup
and update it on disk only at regular intervals and/or on invalidation.
All fops would read/write into the cache. This would even reduce the
number of I/O we are currently doing for each fop.


JOE: the idea of metadata cache is cool for read work loads, but for writes we

would end up doing double writes to the disk. i.e 1 for the actual write or 1 
to update the setxattr.
IMHO we cannot have it in a write back cache (periodic flush to disk) and 
ctime/mtime/atime data loss
or inconsistency will be a problem. Your thoughts?


Xavi
___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-devel
___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-devel


___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-devel


Re: [Gluster-devel] distributed files/directories and [cm]time updates

2016-01-26 Thread Joseph Fernandes
Answer inline:


- Original Message -
From: "Xavier Hernandez" <xhernan...@datalab.es>
To: "Pranith Kumar Karampuri" <pkara...@redhat.com>, "Gluster Devel" 
<gluster-devel@gluster.org>
Sent: Tuesday, January 26, 2016 1:21:37 PM
Subject: Re: [Gluster-devel] distributed files/directories and [cm]time updates

Hi Pranith,

On 26/01/16 03:47, Pranith Kumar Karampuri wrote:
> hi,
>Traditionally gluster has been using ctime/mtime of the
> files/dirs on the bricks as stat output. Problem we are seeing with this
> approach is that, software which depends on it gets confused when there
> are differences in these times. Tar especially gives "file changed as we
> read it" whenever it detects ctime differences when stat is served from
> different bricks. The way we have been trying to solve it is to serve
> the stat structures from same brick in afr, max-time in dht. But it
> doesn't avoid the problem completely. Because there is no way to change
> ctime at the moment(lutimes() only allows mtime, atime), there is little
> we can do to make sure ctimes match after self-heals/xattr
> updates/rebalance. I am wondering if anyone of you solved these problems
> before, if yes how did you go about doing it? It seems like applications
> which depend on this for backups get confused the same way. The only way
> out I see it is to bring ctime to an xattr, but that will need more iops
> and gluster has to keep updating it on quite a few fops.

I did think about this when I was writing ec at the beginning. The idea 
was that the point in time at which each fop is executed were controlled 
by the client by adding an special xattr to each regular fop. Of course 
this would require support inside the storage/posix xlator. At that 
time, adding the needed support to other xlators seemed too complex for 
me, so I decided to do something similar to afr.

Anyway, the idea was like this: for example, when a write fop needs to 
be sent, dht/afr/ec sets the current time in a special xattr, for 
example 'glusterfs.time'. It can be done in a way that if the time is 
already set by a higher xlator, it's not modified. This way DHT could 
set the time in fops involving multiple afr subvolumes. For other fops, 
would be afr who sets the time. It could also be set directly by the top 
most xlator (fuse), but that time could be incorrect because lower 
xlators could delay the fop execution and reorder it. This would need 
more thinking.

That xattr will be received by storage/posix. This xlator will determine 
what times need to be modified and will change them. In the case of a 
write, it can decide to modify mtime and, maybe, atime. For a mkdir or 
create, it will set the times of the new file/directory and also the 
mtime of the parent directory. It depends on the specific fop being 
processed.

mtime, atime and ctime (or even others) could be saved in a special 
posix xattr instead of relying on the file system attributes that cannot 
be modified (at least for ctime).

This solution doesn't require extra fops, So it seems quite clean to me. 
The additional I/O needed in posix could be minimized by implementing a 
metadata cache in storage/posix that would read all metadata on lookup 
and update it on disk only at regular intervals and/or on invalidation. 
All fops would read/write into the cache. This would even reduce the 
number of I/O we are currently doing for each fop.

>>>>>>>>> JOE: the idea of metadata cache is cool for read work loads, but for 
>>>>>>>>> writes we
would end up doing double writes to the disk. i.e 1 for the actual write or 1 
to update the setxattr.
IMHO we cannot have it in a write back cache (periodic flush to disk) and 
ctime/mtime/atime data loss
or inconsistency will be a problem. Your thoughts?


Xavi
___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-devel
___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-devel


Re: [Gluster-devel] distributed files/directories and [cm]time updates

2016-01-26 Thread Xavier Hernandez

Hi Joseph,

On 26/01/16 09:07, Joseph Fernandes wrote:

Answer inline:


- Original Message -
From: "Xavier Hernandez" <xhernan...@datalab.es>
To: "Pranith Kumar Karampuri" <pkara...@redhat.com>, "Gluster Devel" 
<gluster-devel@gluster.org>
Sent: Tuesday, January 26, 2016 1:21:37 PM
Subject: Re: [Gluster-devel] distributed files/directories and [cm]time updates

Hi Pranith,

On 26/01/16 03:47, Pranith Kumar Karampuri wrote:

hi,
Traditionally gluster has been using ctime/mtime of the
files/dirs on the bricks as stat output. Problem we are seeing with this
approach is that, software which depends on it gets confused when there
are differences in these times. Tar especially gives "file changed as we
read it" whenever it detects ctime differences when stat is served from
different bricks. The way we have been trying to solve it is to serve
the stat structures from same brick in afr, max-time in dht. But it
doesn't avoid the problem completely. Because there is no way to change
ctime at the moment(lutimes() only allows mtime, atime), there is little
we can do to make sure ctimes match after self-heals/xattr
updates/rebalance. I am wondering if anyone of you solved these problems
before, if yes how did you go about doing it? It seems like applications
which depend on this for backups get confused the same way. The only way
out I see it is to bring ctime to an xattr, but that will need more iops
and gluster has to keep updating it on quite a few fops.


I did think about this when I was writing ec at the beginning. The idea
was that the point in time at which each fop is executed were controlled
by the client by adding an special xattr to each regular fop. Of course
this would require support inside the storage/posix xlator. At that
time, adding the needed support to other xlators seemed too complex for
me, so I decided to do something similar to afr.

Anyway, the idea was like this: for example, when a write fop needs to
be sent, dht/afr/ec sets the current time in a special xattr, for
example 'glusterfs.time'. It can be done in a way that if the time is
already set by a higher xlator, it's not modified. This way DHT could
set the time in fops involving multiple afr subvolumes. For other fops,
would be afr who sets the time. It could also be set directly by the top
most xlator (fuse), but that time could be incorrect because lower
xlators could delay the fop execution and reorder it. This would need
more thinking.

That xattr will be received by storage/posix. This xlator will determine
what times need to be modified and will change them. In the case of a
write, it can decide to modify mtime and, maybe, atime. For a mkdir or
create, it will set the times of the new file/directory and also the
mtime of the parent directory. It depends on the specific fop being
processed.

mtime, atime and ctime (or even others) could be saved in a special
posix xattr instead of relying on the file system attributes that cannot
be modified (at least for ctime).

This solution doesn't require extra fops, So it seems quite clean to me.
The additional I/O needed in posix could be minimized by implementing a
metadata cache in storage/posix that would read all metadata on lookup
and update it on disk only at regular intervals and/or on invalidation.
All fops would read/write into the cache. This would even reduce the
number of I/O we are currently doing for each fop.


JOE: the idea of metadata cache is cool for read work loads, but for writes we

would end up doing double writes to the disk. i.e 1 for the actual write or 1 
to update the setxattr.
IMHO we cannot have it in a write back cache (periodic flush to disk) and 
ctime/mtime/atime data loss
or inconsistency will be a problem. Your thoughts?


If we want to have all in physical storage at all times, gluster will be 
slow. We only need to be posix compliant, and posix allows some degree 
of "inconsistency" here. i.e. we are not forced to write to physical 
storage until the user application sends a flush or similar request. 
Note that there are xlators that currently take advantage of this: for 
example write-behind and md-cache.


Almost all file systems (if not all) rely on this to improve 
performance, otherwise they would be really slow.


Of course this could cause a temporal inconsistency between bricks, but 
since all cluster xlators (dht, afr and ec) use special xattrs to track 
consistency, a crash before flushing the metadata could be detected and 
repaired (with additional care even a crash while flushing metadata 
could be detected).


Xavi
___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-devel


Re: [Gluster-devel] distributed files/directories and [cm]time updates

2016-01-26 Thread Joseph Fernandes
Answers inline:

- Original Message -
From: "Joe Julian" <j...@julianfamily.org>
To: gluster-devel@gluster.org
Sent: Tuesday, January 26, 2016 1:45:36 PM
Subject: Re: [Gluster-devel] distributed files/directories and [cm]time updates

If the time is set on a file by the client, this increases the critical 
complexity to include the clients whereas before it was only critical to 
have the servers time synced, now the clients should be as well.

>>>>> JOE: If the time on the file is set from client than it becomes difficult 
>>>>> in the compliance case (WORM-Retention)
where we refer to the server time how long we retain a file. This feature is 
not yet in Gluster, but we are looking into it.

Just spitballing here, but what if the time was converted at the posix 
layer as a difference between the current time and the file time and 
converted back somewhere in the client graph? Each server's file time 
would differ by the same amount to its current time [1] so it should be 
a consistent value between servers.


[1] depending on drift, but if the admin can't manage clocks, there's 
not much gluster could or should do about that.

On 01/26/2016 12:07 AM, Joseph Fernandes wrote:
> Answer inline:
>
>
> - Original Message -
> From: "Xavier Hernandez" <xhernan...@datalab.es>
> To: "Pranith Kumar Karampuri" <pkara...@redhat.com>, "Gluster Devel" 
> <gluster-devel@gluster.org>
> Sent: Tuesday, January 26, 2016 1:21:37 PM
> Subject: Re: [Gluster-devel] distributed files/directories and [cm]time   
> updates
>
> Hi Pranith,
>
> On 26/01/16 03:47, Pranith Kumar Karampuri wrote:
>> hi,
>> Traditionally gluster has been using ctime/mtime of the
>> files/dirs on the bricks as stat output. Problem we are seeing with this
>> approach is that, software which depends on it gets confused when there
>> are differences in these times. Tar especially gives "file changed as we
>> read it" whenever it detects ctime differences when stat is served from
>> different bricks. The way we have been trying to solve it is to serve
>> the stat structures from same brick in afr, max-time in dht. But it
>> doesn't avoid the problem completely. Because there is no way to change
>> ctime at the moment(lutimes() only allows mtime, atime), there is little
>> we can do to make sure ctimes match after self-heals/xattr
>> updates/rebalance. I am wondering if anyone of you solved these problems
>> before, if yes how did you go about doing it? It seems like applications
>> which depend on this for backups get confused the same way. The only way
>> out I see it is to bring ctime to an xattr, but that will need more iops
>> and gluster has to keep updating it on quite a few fops.
> I did think about this when I was writing ec at the beginning. The idea
> was that the point in time at which each fop is executed were controlled
> by the client by adding an special xattr to each regular fop. Of course
> this would require support inside the storage/posix xlator. At that
> time, adding the needed support to other xlators seemed too complex for
> me, so I decided to do something similar to afr.
>
> Anyway, the idea was like this: for example, when a write fop needs to
> be sent, dht/afr/ec sets the current time in a special xattr, for
> example 'glusterfs.time'. It can be done in a way that if the time is
> already set by a higher xlator, it's not modified. This way DHT could
> set the time in fops involving multiple afr subvolumes. For other fops,
> would be afr who sets the time. It could also be set directly by the top
> most xlator (fuse), but that time could be incorrect because lower
> xlators could delay the fop execution and reorder it. This would need
> more thinking.
>
> That xattr will be received by storage/posix. This xlator will determine
> what times need to be modified and will change them. In the case of a
> write, it can decide to modify mtime and, maybe, atime. For a mkdir or
> create, it will set the times of the new file/directory and also the
> mtime of the parent directory. It depends on the specific fop being
> processed.
>
> mtime, atime and ctime (or even others) could be saved in a special
> posix xattr instead of relying on the file system attributes that cannot
> be modified (at least for ctime).
>
> This solution doesn't require extra fops, So it seems quite clean to me.
> The additional I/O needed in posix could be minimized by implementing a
> metadata cache in storage/posix that would read all metadata on lookup
> and update it on disk only at regular intervals and/or on invalidation.
> All fops would read/write into the cache. This would even reduce the
> nu

Re: [Gluster-devel] distributed files/directories and [cm]time updates

2016-01-26 Thread Raghavendra Bhat
Hi Xavier,

There is a patch sent for review which implements the metadata cache in the
posix layer.  What the changes do is this:

Whenever there is a fresh lookup on a object (file/directory/symlink),
posix xlator saves the stat attributes of that object in its cache.
As of now, whenever there is a fop on a object, posix tries to build HANDLE
of the object by looking into gfid based backend (i.e. .glusterfs
directory) and doing stat to check if the gfid exists. The patch makes
chages to posix to check into its own cache first and return if it can find
the attributes. If not, then look into actual gfid backend.

But as of now, there is no cache invalidation. Whenever there is a
setattr() fop to change the attributes of a object, the new stat info is
saved in the cache once the fop is successful on disk.

The patch can be found here. (http://review.gluster.org/#/c/12157/).

Regards,
Raghavendra

On Tue, Jan 26, 2016 at 2:51 AM, Xavier Hernandez 
wrote:

> Hi Pranith,
>
> On 26/01/16 03:47, Pranith Kumar Karampuri wrote:
>
>> hi,
>>Traditionally gluster has been using ctime/mtime of the
>> files/dirs on the bricks as stat output. Problem we are seeing with this
>> approach is that, software which depends on it gets confused when there
>> are differences in these times. Tar especially gives "file changed as we
>> read it" whenever it detects ctime differences when stat is served from
>> different bricks. The way we have been trying to solve it is to serve
>> the stat structures from same brick in afr, max-time in dht. But it
>> doesn't avoid the problem completely. Because there is no way to change
>> ctime at the moment(lutimes() only allows mtime, atime), there is little
>> we can do to make sure ctimes match after self-heals/xattr
>> updates/rebalance. I am wondering if anyone of you solved these problems
>> before, if yes how did you go about doing it? It seems like applications
>> which depend on this for backups get confused the same way. The only way
>> out I see it is to bring ctime to an xattr, but that will need more iops
>> and gluster has to keep updating it on quite a few fops.
>>
>
> I did think about this when I was writing ec at the beginning. The idea
> was that the point in time at which each fop is executed were controlled by
> the client by adding an special xattr to each regular fop. Of course this
> would require support inside the storage/posix xlator. At that time, adding
> the needed support to other xlators seemed too complex for me, so I decided
> to do something similar to afr.
>
> Anyway, the idea was like this: for example, when a write fop needs to be
> sent, dht/afr/ec sets the current time in a special xattr, for example
> 'glusterfs.time'. It can be done in a way that if the time is already set
> by a higher xlator, it's not modified. This way DHT could set the time in
> fops involving multiple afr subvolumes. For other fops, would be afr who
> sets the time. It could also be set directly by the top most xlator (fuse),
> but that time could be incorrect because lower xlators could delay the fop
> execution and reorder it. This would need more thinking.
>
> That xattr will be received by storage/posix. This xlator will determine
> what times need to be modified and will change them. In the case of a
> write, it can decide to modify mtime and, maybe, atime. For a mkdir or
> create, it will set the times of the new file/directory and also the mtime
> of the parent directory. It depends on the specific fop being processed.
>
> mtime, atime and ctime (or even others) could be saved in a special posix
> xattr instead of relying on the file system attributes that cannot be
> modified (at least for ctime).
>
> This solution doesn't require extra fops, So it seems quite clean to me.
> The additional I/O needed in posix could be minimized by implementing a
> metadata cache in storage/posix that would read all metadata on lookup and
> update it on disk only at regular intervals and/or on invalidation. All
> fops would read/write into the cache. This would even reduce the number of
> I/O we are currently doing for each fop.
>
> Xavi
>
> ___
> Gluster-devel mailing list
> Gluster-devel@gluster.org
> http://www.gluster.org/mailman/listinfo/gluster-devel
>
___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-devel

Re: [Gluster-devel] distributed files/directories and [cm]time updates

2016-01-26 Thread Jeff Darcy
> If the time is set on a file by the client, this increases the critical
> complexity to include the clients whereas before it was only critical to
> have the servers time synced, now the clients should be as well.

With any kind of server-side replication, the times could be generated by
the first server (instead of the client) and propagated to the others.
That leaves only the striping/sharding case, which I see Shyam has already
touched on.
___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-devel


Re: [Gluster-devel] distributed files/directories and [cm]time updates

2016-01-25 Thread Xavier Hernandez

Hi Pranith,

On 26/01/16 03:47, Pranith Kumar Karampuri wrote:

hi,
   Traditionally gluster has been using ctime/mtime of the
files/dirs on the bricks as stat output. Problem we are seeing with this
approach is that, software which depends on it gets confused when there
are differences in these times. Tar especially gives "file changed as we
read it" whenever it detects ctime differences when stat is served from
different bricks. The way we have been trying to solve it is to serve
the stat structures from same brick in afr, max-time in dht. But it
doesn't avoid the problem completely. Because there is no way to change
ctime at the moment(lutimes() only allows mtime, atime), there is little
we can do to make sure ctimes match after self-heals/xattr
updates/rebalance. I am wondering if anyone of you solved these problems
before, if yes how did you go about doing it? It seems like applications
which depend on this for backups get confused the same way. The only way
out I see it is to bring ctime to an xattr, but that will need more iops
and gluster has to keep updating it on quite a few fops.


I did think about this when I was writing ec at the beginning. The idea 
was that the point in time at which each fop is executed were controlled 
by the client by adding an special xattr to each regular fop. Of course 
this would require support inside the storage/posix xlator. At that 
time, adding the needed support to other xlators seemed too complex for 
me, so I decided to do something similar to afr.


Anyway, the idea was like this: for example, when a write fop needs to 
be sent, dht/afr/ec sets the current time in a special xattr, for 
example 'glusterfs.time'. It can be done in a way that if the time is 
already set by a higher xlator, it's not modified. This way DHT could 
set the time in fops involving multiple afr subvolumes. For other fops, 
would be afr who sets the time. It could also be set directly by the top 
most xlator (fuse), but that time could be incorrect because lower 
xlators could delay the fop execution and reorder it. This would need 
more thinking.


That xattr will be received by storage/posix. This xlator will determine 
what times need to be modified and will change them. In the case of a 
write, it can decide to modify mtime and, maybe, atime. For a mkdir or 
create, it will set the times of the new file/directory and also the 
mtime of the parent directory. It depends on the specific fop being 
processed.


mtime, atime and ctime (or even others) could be saved in a special 
posix xattr instead of relying on the file system attributes that cannot 
be modified (at least for ctime).


This solution doesn't require extra fops, So it seems quite clean to me. 
The additional I/O needed in posix could be minimized by implementing a 
metadata cache in storage/posix that would read all metadata on lookup 
and update it on disk only at regular intervals and/or on invalidation. 
All fops would read/write into the cache. This would even reduce the 
number of I/O we are currently doing for each fop.


Xavi
___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-devel