Re: [Gluster-devel] regarding GF_CONTENT_KEY and dht2 - perf with small files

2016-02-04 Thread Ira Cooper
Pranith Kumar Karampuri  writes:

> On 02/03/2016 07:54 PM, Jeff Darcy wrote:
>>> Problem is with workloads which know the files that need to be read
>>> without readdir, like hyperlinks (webserver), swift objects etc. These
>>> are two I know of which will have this problem, which can't be improved
>>> because we don't have metadata, data co-located. I have been trying to
>>> think of a solution for past few days. Nothing good is coming up :-/
>> In those cases, caching (at the MDS) would certainly help a lot.  Some
>> variation of the compounding infrastructure under development for Samba
>> etc. might also apply, since this really is a compound operation.
> Even with compound fops It will still require two sequential network 
> operations from dht2. One to MDC and one to DC So I don't think it helps.

You can do better.

You control the MDC.

The MDC goes ahead and forwards the request, under the client's GUID.

That'll cut 1/2 a RTT.

Cheers,

-Ira
___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-devel


Re: [Gluster-devel] regarding GF_CONTENT_KEY and dht2 - perf with small files

2016-02-04 Thread Jeff Darcy
> Even with compound fops It will still require two sequential network
> operations from dht2. One to MDC and one to DC So I don't think it helps.

There are still two hops, but making it a compound op keeps the
server-to-server communication in the compounding translator (which
should already be able to handle that case) instead of having to put it
in the MDS.  It's not so much a matter of improving performance -
turning a client/server hop into server/server or server/cache should do
that - but of implementing that improvement in a clean way.
___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-devel


Re: [Gluster-devel] regarding GF_CONTENT_KEY and dht2 - perf with small files

2016-02-04 Thread Venky Shankar
On Thu, Feb 04, 2016 at 11:34:04AM +0530, Shyam wrote:
> On 02/04/2016 09:38 AM, Vijay Bellur wrote:
> >On 02/03/2016 11:34 AM, Venky Shankar wrote:
> >>On Wed, Feb 03, 2016 at 09:24:06AM -0500, Jeff Darcy wrote:
> Problem is with workloads which know the files that need to be read
> without readdir, like hyperlinks (webserver), swift objects etc. These
> are two I know of which will have this problem, which can't be improved
> because we don't have metadata, data co-located. I have been trying to
> think of a solution for past few days. Nothing good is coming up :-/
> >>>
> >>>In those cases, caching (at the MDS) would certainly help a lot.  Some
> >>>variation of the compounding infrastructure under development for Samba
> >>>etc. might also apply, since this really is a compound operation.
> 
> Compounding in this case can help, but still without the cache, the read has
> to go to the DS, and on such a compounding, the MDS would reach out to the
> DS for the information than the client. Another possibility based on what we
> decide as the cache mechanism.
> 
> >>
> >>When a client is done modifying a file, MDS would refresh it's size,
> >>mtime
> >>attributes by fetching it from the DS. As part of this refresh, DS could
> >>additionally send back the content if the file size falls in range, with
> >>MDS persisting it, sending it back for subsequent lookup calls as it does
> >>now. The content (on MDS) can be zapped once the file size crosses the
> >>defined limit.
> 
> Venky, when you say persisting, I assume on disk, is that right?

Definitely on-disk.

> 
> If so, then the MDS storage size requirements would increase (based on
> amount of file data that need to be stored). As of now it is only inodes,
> and as we move to a db a record. In this case we may have *fatter* MDS
> partitions. Any comments/thoughts on that?

The MDS storage requirement does go up by a considerable amount due to the
fact that normally the number of MDS nodes would be far less in number than
the DS nodes. So, yes, the MDS does become fat, but it's important to have
data inline with it's inode to boost small file performance (at least when
the file is not under modification).

> 
> As with memory I would assume some form of eviction of data from MDS, to
> control the space utilization here as a possibility.

Maybe. Using TTL in a key-value store might be an option. But, IIRC, TTLs
can be set for an entire record and not for parts of a record. We'd need
to think more about this anyway.

> 
> >>
> >
> >I like the idea. However the memory implications of maintaining content
> >in MDS is something to watch out for. quick-read is interested in files
> >of size 64k by default and with a reasonable number of files in that
> >range, we might end up consuming significant memory with this scheme.
> 
> Vijay, I think what Venky states is to stash the file on the local storage
> and not in memory. If it was in memory then brick process restarts would
> nuke the cache, and either we need mechanisms to rebuild/warm the cache or
> just start caching afresh.
> 
> If we were caching in memory, then yes the concern is valid, and one
> possibility is  some form of LRU for the same, to keep memory consumption in
> check.

As stated earlier, it's a persistent cache which may or may not have a layer
of in-memory cache itself. I would leave all that to the key-value DB (when
we use one) as it most probably would be doing that.

> 
> Overall I would steer away from memory for this use case, and use the disk,
> as we do not know which files to cache (well in either case, but disk offers
> us more space to possibly punt on that issue). For files where the cache is
> missing and the file is small enough, either perform async read from the
> client (gaining some overlap time with the app) or just let it be, as we
> would get the open/read anyway, but would slow things down.

Yes. async reads for files which have missing inline data with inode plus
satisfy the size range requirement.

> 
> >
> >-Vijay
> >___
> >Gluster-devel mailing list
> >Gluster-devel@gluster.org
> >http://www.gluster.org/mailman/listinfo/gluster-devel
___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-devel


Re: [Gluster-devel] regarding GF_CONTENT_KEY and dht2 - perf with small files

2016-02-04 Thread Pranith Kumar Karampuri



On 02/03/2016 07:54 PM, Jeff Darcy wrote:

Problem is with workloads which know the files that need to be read
without readdir, like hyperlinks (webserver), swift objects etc. These
are two I know of which will have this problem, which can't be improved
because we don't have metadata, data co-located. I have been trying to
think of a solution for past few days. Nothing good is coming up :-/

In those cases, caching (at the MDS) would certainly help a lot.  Some
variation of the compounding infrastructure under development for Samba
etc. might also apply, since this really is a compound operation.
Even with compound fops It will still require two sequential network 
operations from dht2. One to MDC and one to DC So I don't think it helps.


Pranith
___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-devel


Re: [Gluster-devel] regarding GF_CONTENT_KEY and dht2 - perf with small files

2016-02-03 Thread Shyam

On 02/04/2016 09:38 AM, Vijay Bellur wrote:

On 02/03/2016 11:34 AM, Venky Shankar wrote:

On Wed, Feb 03, 2016 at 09:24:06AM -0500, Jeff Darcy wrote:

Problem is with workloads which know the files that need to be read
without readdir, like hyperlinks (webserver), swift objects etc. These
are two I know of which will have this problem, which can't be improved
because we don't have metadata, data co-located. I have been trying to
think of a solution for past few days. Nothing good is coming up :-/


In those cases, caching (at the MDS) would certainly help a lot.  Some
variation of the compounding infrastructure under development for Samba
etc. might also apply, since this really is a compound operation.


Compounding in this case can help, but still without the cache, the read 
has to go to the DS, and on such a compounding, the MDS would reach out 
to the DS for the information than the client. Another possibility based 
on what we decide as the cache mechanism.




When a client is done modifying a file, MDS would refresh it's size,
mtime
attributes by fetching it from the DS. As part of this refresh, DS could
additionally send back the content if the file size falls in range, with
MDS persisting it, sending it back for subsequent lookup calls as it does
now. The content (on MDS) can be zapped once the file size crosses the
defined limit.


Venky, when you say persisting, I assume on disk, is that right?

If so, then the MDS storage size requirements would increase (based on 
amount of file data that need to be stored). As of now it is only 
inodes, and as we move to a db a record. In this case we may have 
*fatter* MDS partitions. Any comments/thoughts on that?


As with memory I would assume some form of eviction of data from MDS, to 
control the space utilization here as a possibility.






I like the idea. However the memory implications of maintaining content
in MDS is something to watch out for. quick-read is interested in files
of size 64k by default and with a reasonable number of files in that
range, we might end up consuming significant memory with this scheme.


Vijay, I think what Venky states is to stash the file on the local 
storage and not in memory. If it was in memory then brick process 
restarts would nuke the cache, and either we need mechanisms to 
rebuild/warm the cache or just start caching afresh.


If we were caching in memory, then yes the concern is valid, and one 
possibility is  some form of LRU for the same, to keep memory 
consumption in check.


Overall I would steer away from memory for this use case, and use the 
disk, as we do not know which files to cache (well in either case, but 
disk offers us more space to possibly punt on that issue). For files 
where the cache is missing and the file is small enough, either perform 
async read from the client (gaining some overlap time with the app) or 
just let it be, as we would get the open/read anyway, but would slow 
things down.




-Vijay
___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-devel

___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-devel


Re: [Gluster-devel] regarding GF_CONTENT_KEY and dht2 - perf with small files

2016-02-03 Thread Vijay Bellur

On 02/03/2016 11:34 AM, Venky Shankar wrote:

On Wed, Feb 03, 2016 at 09:24:06AM -0500, Jeff Darcy wrote:

Problem is with workloads which know the files that need to be read
without readdir, like hyperlinks (webserver), swift objects etc. These
are two I know of which will have this problem, which can't be improved
because we don't have metadata, data co-located. I have been trying to
think of a solution for past few days. Nothing good is coming up :-/


In those cases, caching (at the MDS) would certainly help a lot.  Some
variation of the compounding infrastructure under development for Samba
etc. might also apply, since this really is a compound operation.


When a client is done modifying a file, MDS would refresh it's size, mtime
attributes by fetching it from the DS. As part of this refresh, DS could
additionally send back the content if the file size falls in range, with
MDS persisting it, sending it back for subsequent lookup calls as it does
now. The content (on MDS) can be zapped once the file size crosses the
defined limit.



I like the idea. However the memory implications of maintaining content 
in MDS is something to watch out for. quick-read is interested in files 
of size 64k by default and with a reasonable number of files in that 
range, we might end up consuming significant memory with this scheme.


-Vijay
___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-devel


Re: [Gluster-devel] regarding GF_CONTENT_KEY and dht2 - perf with small files

2016-02-03 Thread Venky Shankar
On Wed, Feb 03, 2016 at 09:24:06AM -0500, Jeff Darcy wrote:
> > Problem is with workloads which know the files that need to be read
> > without readdir, like hyperlinks (webserver), swift objects etc. These
> > are two I know of which will have this problem, which can't be improved
> > because we don't have metadata, data co-located. I have been trying to
> > think of a solution for past few days. Nothing good is coming up :-/
> 
> In those cases, caching (at the MDS) would certainly help a lot.  Some
> variation of the compounding infrastructure under development for Samba
> etc. might also apply, since this really is a compound operation.

When a client is done modifying a file, MDS would refresh it's size, mtime
attributes by fetching it from the DS. As part of this refresh, DS could
additionally send back the content if the file size falls in range, with
MDS persisting it, sending it back for subsequent lookup calls as it does
now. The content (on MDS) can be zapped once the file size crosses the
defined limit.

But, when there are open file descriptors on an inode (O_RDWR || O_WRONLY
on a file), the size cannot be trusted (as MDS only knows about the updated
size after last close), which would be the degraded case.

Thanks,

Venky
___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-devel


Re: [Gluster-devel] regarding GF_CONTENT_KEY and dht2 - perf with small files

2016-02-03 Thread Shyam

On 02/03/2016 07:54 PM, Jeff Darcy wrote:

Problem is with workloads which know the files that need to be read
without readdir, like hyperlinks (webserver), swift objects etc. These
are two I know of which will have this problem, which can't be improved
because we don't have metadata, data co-located. I have been trying to
think of a solution for past few days. Nothing good is coming up :-/


In those cases, caching (at the MDS) would certainly help a lot.  Some
variation of the compounding infrastructure under development for Samba
etc. might also apply, since this really is a compound operation.



The above is certainly an option, need to process it a bit more to 
respond sanely.


Another one is to generate the GFID for a file with parGFID+basename as 
input (which was something Pranith brought a few mails back in this 
chain). There was concern that we will have GFID clashes, but further 
reasoning suggests that it would not. An example follows,


Good cases:
- /D1/File is created, with top 2 bytes of the files GFID as the bucket 
(same as D1 bucket), and rest of GFID as some UUID generation of pGFID 
(gfid of D1) + base name
- When this file is looked up by name, its GFID can be generated at the 
client side as a hint, and the same fan out of lookup to MDS and read to 
DS can be initiated
* Validity of the READ data, is good only when the lookup agrees on the 
same GFID for the file


Bad cases:
- On a rename, the GFID of the file does not change, and so if /D1/File 
was renamed to /D2/File1, then a subsequent lookup could fail to 
prefetch the read, as the GFID hint generated is now based on GFID of D2 
and new name File1
- If post a rename /D1/File is again created, the GFID 
generated/requested by the client for this file would clash with the 
already generated GFID, hence the DHT server would decide to return a 
new GFID, that has no relation to the one generated by the hint. Again 
resulting in the nint failing


So with the above scheme, as long as files are not renamed the hint 
serves its purpose to prefetch even with just the name and parGFID.


One gotcha is that, I see a pattern with applications, that create a tmp 
file and then renames it to the real file name, sort of a swap file and 
then rename it to the real file as needed. For all such applications the 
hints above would fail.


I believe even Swift also uses a similar trick on the FS to rename an 
object, once it is considered fully written to. Another case would be 
compile workload. So overall the above as a scheme could work to 
alleviate the problem somewhat, but may cause harm in others (where the 
GFID hint is incorrect and so we end up sending a read without reason).


The above could easily be prototyped with DHT2 to see its benefits, so 
we will try that out at some point in the future.


Shyam
___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-devel


Re: [Gluster-devel] regarding GF_CONTENT_KEY and dht2 - perf with small files

2016-02-03 Thread Jeff Darcy
> Problem is with workloads which know the files that need to be read
> without readdir, like hyperlinks (webserver), swift objects etc. These
> are two I know of which will have this problem, which can't be improved
> because we don't have metadata, data co-located. I have been trying to
> think of a solution for past few days. Nothing good is coming up :-/

In those cases, caching (at the MDS) would certainly help a lot.  Some
variation of the compounding infrastructure under development for Samba
etc. might also apply, since this really is a compound operation.
___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-devel


Re: [Gluster-devel] regarding GF_CONTENT_KEY and dht2 - perf with small files

2016-02-03 Thread Pranith Kumar Karampuri



The file data would be located based on its GFID, so before the *first*
lookup/stat for a file, there is no way to know it's GFID.
NOTE: Instead of a name hash the GFID hash is used, to get immunity
against renames and the like, as a name hash could change the location
information for the file (among other reasons).


Another manner of achieving the same when the GFID of the file is 
known (from a readdir) is to wind the lookup and read of size to the 
respective MDS and DS, where the lookup would be responded to once the 
MDS responds, and the DS response is cached for the subsequent 
open+read case. So on the wire we would have a fan out of 2 FOPs, but 
still satisfy the quick read requirements.


Tar kind of workload doesn't have a problem because we know the gfid 
after readdirp.




I would assume the above resolves the problem posted, are there cases 
where we do not know the GFID of the file? i.e no readdir performed 
and client knows the file name that it wants to operate on? Do we have 
traces of the webserver workload to see if it generates names on the 
fly or does a readdir prior to that?


Problem is with workloads which know the files that need to be read 
without readdir, like hyperlinks (webserver), swift objects etc. These 
are two I know of which will have this problem, which can't be improved 
because we don't have metadata, data co-located. I have been trying to 
think of a solution for past few days. Nothing good is coming up :-/


Pranith
___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-devel


Re: [Gluster-devel] regarding GF_CONTENT_KEY and dht2 - perf with small files

2016-02-03 Thread Shyam

On 02/03/2016 09:20 AM, Shyam wrote:

On 02/02/2016 06:22 PM, Jeff Darcy wrote:

   Background: Quick-read + open-behind xlators are developed to
help
in small file workload reads like apache webserver, tar etc to get the
data of the file in lookup FOP itself. What happens is, when a lookup
FOP is executed, GF_CONTENT_KEY is added in xdata with max-length and
posix xlator reads the file and fills the data in xdata response if this
key is present as long as the file-size is less than max-length given in
the xdata. So when we do a tar of something like a kernel tree with
small files, if we look at profile of the bricks all we see are lookups.
OPEN + READ fops will not be sent at all over the network.

   With dht2 because data is present on a different cluster. We
can't
get the data in lookup. Shyam was telling me that opens are also sent to
metadata cluster. That will make perf in this usecase back to where it
was before introducing these two features i.e. 1/3 of current perf
(Lookup vs lookup+open+read)


This is interesting thanks for the heads up.



Is "1/3 of current perf" based on actual measurements?  My understanding
was that the translators in question exist to send requests *in parallel*
with the original lookup stream.  That means it might be 3x the messages,
but it will only be 1/3 the performance if the network is saturated.
Also, the lookup is not guaranteed to be only one message.  It might be
as many as N (the number of bricks), so by the reasoning above the
performance would only drop to N/N+2.  I think the real situation is a
bit more complicated - and less dire - than you suggest.


I suggest that we send some fop at the
time of open to data cluster and change quick-read to cache this data on
open (if not already) then we can reduce the perf hit to 1/2 of current
perf, i.e. lookup+open.


At first glance, it seems pretty simple to do something like this, and
pretty obvious that we should.  The tricky question is: where should we
send that other op, before lookup has told us where the partition
containing that file is?  If there's some reasonable guess we can make,
the sending an open+read in parallel with the lookup will be helpful.
If not, then it will probably be a waste of time and network resources.
Shyam, is enough of this information being cached *on the clients* to
make this effective?



The file data would be located based on its GFID, so before the *first*
lookup/stat for a file, there is no way to know it's GFID.
NOTE: Instead of a name hash the GFID hash is used, to get immunity
against renames and the like, as a name hash could change the location
information for the file (among other reasons).


Another manner of achieving the same when the GFID of the file is known 
(from a readdir) is to wind the lookup and read of size to the 
respective MDS and DS, where the lookup would be responded to once the 
MDS responds, and the DS response is cached for the subsequent open+read 
case. So on the wire we would have a fan out of 2 FOPs, but still 
satisfy the quick read requirements.


I would assume the above resolves the problem posted, are there cases 
where we do not know the GFID of the file? i.e no readdir performed and 
client knows the file name that it wants to operate on? Do we have 
traces of the webserver workload to see if it generates names on the fly 
or does a readdir prior to that?




The open+read can be done as a single FOP,
   - open for a read only case can do access checking on the client to
allow the FOP to proceed to the DS without hitting the MDS for an open
token

The client side cache is important from this and other such
perspectives. It should also leverage upcall infra to keep the cache
loosely coherent.

One thing to note here would be, for the client to do a lookup (where
the file name should be known before hand), either a readdir/(p) has to
have happened, or the client knows the name already (say application
generated names). For the former (readdir case), there is enough
information on the client to not need a lookup, but rather just do the
open+read on the DS. For the latter the first lookup cannot be avoided,
degrading this to a lookup+(open+read).

Some further tricks can be done to do readdir prefetching on such
workloads, as the MDS runs on a DB (eventually), piggybacking more
entries than requested on a lookup. I would possibly leave that for
later, based on performance numbers in the small file area.

Shyam

___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-devel


Re: [Gluster-devel] regarding GF_CONTENT_KEY and dht2 - perf with small files

2016-02-02 Thread Pranith Kumar Karampuri



On 02/03/2016 11:49 AM, Pranith Kumar Karampuri wrote:



On 02/03/2016 09:20 AM, Shyam wrote:

On 02/02/2016 06:22 PM, Jeff Darcy wrote:
   Background: Quick-read + open-behind xlators are developed 
to help

in small file workload reads like apache webserver, tar etc to get the
data of the file in lookup FOP itself. What happens is, when a lookup
FOP is executed, GF_CONTENT_KEY is added in xdata with max-length and
posix xlator reads the file and fills the data in xdata response if 
this
key is present as long as the file-size is less than max-length 
given in

the xdata. So when we do a tar of something like a kernel tree with
small files, if we look at profile of the bricks all we see are 
lookups.

OPEN + READ fops will not be sent at all over the network.

   With dht2 because data is present on a different cluster. We 
can't
get the data in lookup. Shyam was telling me that opens are also 
sent to

metadata cluster. That will make perf in this usecase back to where it
was before introducing these two features i.e. 1/3 of current perf
(Lookup vs lookup+open+read)


This is interesting thanks for the heads up.



Is "1/3 of current perf" based on actual measurements?  My 
understanding
was that the translators in question exist to send requests *in 
parallel*
with the original lookup stream.  That means it might be 3x the 
messages,

but it will only be 1/3 the performance if the network is saturated.
Also, the lookup is not guaranteed to be only one message.  It might be
as many as N (the number of bricks), so by the reasoning above the
performance would only drop to N/N+2.  I think the real situation is a
bit more complicated - and less dire - than you suggest.


I suggest that we send some fop at the
time of open to data cluster and change quick-read to cache this 
data on
open (if not already) then we can reduce the perf hit to 1/2 of 
current

perf, i.e. lookup+open.


At first glance, it seems pretty simple to do something like this, and
pretty obvious that we should.  The tricky question is: where should we
send that other op, before lookup has told us where the partition
containing that file is?  If there's some reasonable guess we can make,
the sending an open+read in parallel with the lookup will be helpful.
If not, then it will probably be a waste of time and network resources.
Shyam, is enough of this information being cached *on the clients* to
make this effective?



The file data would be located based on its GFID, so before the 
*first* lookup/stat for a file, there is no way to know it's GFID.
NOTE: Instead of a name hash the GFID hash is used, to get immunity 
against renames and the like, as a name hash could change the 
location information for the file (among other reasons).


The open+read can be done as a single FOP,
  - open for a read only case can do access checking on the client to 
allow the FOP to proceed to the DS without hitting the MDS for an 
open token


The client side cache is important from this and other such 
perspectives. It should also leverage upcall infra to keep the cache 
loosely coherent.


One thing to note here would be, for the client to do a lookup (where 
the file name should be known before hand), either a readdir/(p) has 
to have happened, or the client knows the name already (say 
application generated names). For the former (readdir case), there is 
enough information on the client to not need a lookup, but rather 
just do the open+read on the DS. For the latter the first lookup 
cannot be avoided, degrading this to a lookup+(open+read).


Some further tricks can be done to do readdir prefetching on such 
workloads, as the MDS runs on a DB (eventually), piggybacking more 
entries than requested on a lookup. I would possibly leave that for 
later, based on performance numbers in the small file area.


I strongly suggest that we don't postpone this to later as I think 
this is a solved problem. http://www.ietf.org/rfc/rfc4122.txt section 
4.3 may be of help here. i.e. create UUID based on string, namespace. 
So we can use pgfid as namespace and filename as string. I understand 
that we will get into 2 hops if the file is renamed, but it is the 
best we can do right now. We can take help from crypto team in Redhat 
to make sure we do the right thing. If we get this implementation in 
dht2 after the code is released all the files created with old 
gfid-generation will work with half the possible perf.

Gah! ignore, it will lead to gfid collisions :-/

Pranith


Pranith


Shyam


___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-devel


___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-devel


Re: [Gluster-devel] regarding GF_CONTENT_KEY and dht2 - perf with small files

2016-02-02 Thread Pranith Kumar Karampuri



On 02/03/2016 09:20 AM, Shyam wrote:

On 02/02/2016 06:22 PM, Jeff Darcy wrote:
   Background: Quick-read + open-behind xlators are developed to 
help

in small file workload reads like apache webserver, tar etc to get the
data of the file in lookup FOP itself. What happens is, when a lookup
FOP is executed, GF_CONTENT_KEY is added in xdata with max-length and
posix xlator reads the file and fills the data in xdata response if 
this
key is present as long as the file-size is less than max-length 
given in

the xdata. So when we do a tar of something like a kernel tree with
small files, if we look at profile of the bricks all we see are 
lookups.

OPEN + READ fops will not be sent at all over the network.

   With dht2 because data is present on a different cluster. We 
can't
get the data in lookup. Shyam was telling me that opens are also 
sent to

metadata cluster. That will make perf in this usecase back to where it
was before introducing these two features i.e. 1/3 of current perf
(Lookup vs lookup+open+read)


This is interesting thanks for the heads up.



Is "1/3 of current perf" based on actual measurements?  My understanding
was that the translators in question exist to send requests *in 
parallel*
with the original lookup stream.  That means it might be 3x the 
messages,

but it will only be 1/3 the performance if the network is saturated.
Also, the lookup is not guaranteed to be only one message.  It might be
as many as N (the number of bricks), so by the reasoning above the
performance would only drop to N/N+2.  I think the real situation is a
bit more complicated - and less dire - than you suggest.


I suggest that we send some fop at the
time of open to data cluster and change quick-read to cache this 
data on

open (if not already) then we can reduce the perf hit to 1/2 of current
perf, i.e. lookup+open.


At first glance, it seems pretty simple to do something like this, and
pretty obvious that we should.  The tricky question is: where should we
send that other op, before lookup has told us where the partition
containing that file is?  If there's some reasonable guess we can make,
the sending an open+read in parallel with the lookup will be helpful.
If not, then it will probably be a waste of time and network resources.
Shyam, is enough of this information being cached *on the clients* to
make this effective?



The file data would be located based on its GFID, so before the 
*first* lookup/stat for a file, there is no way to know it's GFID.
NOTE: Instead of a name hash the GFID hash is used, to get immunity 
against renames and the like, as a name hash could change the location 
information for the file (among other reasons).


The open+read can be done as a single FOP,
  - open for a read only case can do access checking on the client to 
allow the FOP to proceed to the DS without hitting the MDS for an open 
token


The client side cache is important from this and other such 
perspectives. It should also leverage upcall infra to keep the cache 
loosely coherent.


One thing to note here would be, for the client to do a lookup (where 
the file name should be known before hand), either a readdir/(p) has 
to have happened, or the client knows the name already (say 
application generated names). For the former (readdir case), there is 
enough information on the client to not need a lookup, but rather just 
do the open+read on the DS. For the latter the first lookup cannot be 
avoided, degrading this to a lookup+(open+read).


Some further tricks can be done to do readdir prefetching on such 
workloads, as the MDS runs on a DB (eventually), piggybacking more 
entries than requested on a lookup. I would possibly leave that for 
later, based on performance numbers in the small file area.


I strongly suggest that we don't postpone this to later as I think this 
is a solved problem. http://www.ietf.org/rfc/rfc4122.txt section 4.3 may 
be of help here. i.e. create UUID based on string, namespace. So we can 
use pgfid as namespace and filename as string. I understand that we will 
get into 2 hops if the file is renamed, but it is the best we can do 
right now. We can take help from crypto team in Redhat to make sure we 
do the right thing. If we get this implementation in dht2 after the code 
is released all the files created with old gfid-generation will work 
with half the possible perf.


Pranith


Shyam


___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-devel


Re: [Gluster-devel] regarding GF_CONTENT_KEY and dht2 - perf with small files

2016-02-02 Thread Raghavendra Gowdappa


- Original Message -
> From: "Pranith Kumar Karampuri" 
> To: "Jeff Darcy" 
> Cc: "Gluster Devel" 
> Sent: Tuesday, February 2, 2016 7:52:25 PM
> Subject: Re: [Gluster-devel] regarding GF_CONTENT_KEY and dht2 - perf with
> small files
> 
> 
> 
> On 02/02/2016 06:22 PM, Jeff Darcy wrote:
> >>Background: Quick-read + open-behind xlators are developed to help
> >> in small file workload reads like apache webserver, tar etc to get the
> >> data of the file in lookup FOP itself. What happens is, when a lookup
> >> FOP is executed, GF_CONTENT_KEY is added in xdata with max-length and
> >> posix xlator reads the file and fills the data in xdata response if this
> >> key is present as long as the file-size is less than max-length given in
> >> the xdata. So when we do a tar of something like a kernel tree with
> >> small files, if we look at profile of the bricks all we see are lookups.
> >> OPEN + READ fops will not be sent at all over the network.
> >>
> >>With dht2 because data is present on a different cluster. We can't
> >> get the data in lookup. Shyam was telling me that opens are also sent to
> >> metadata cluster. That will make perf in this usecase back to where it
> >> was before introducing these two features i.e. 1/3 of current perf
> >> (Lookup vs lookup+open+read)
> > Is "1/3 of current perf" based on actual measurements?  My understanding
> > was that the translators in question exist to send requests *in parallel*
> > with the original lookup stream.  That means it might be 3x the messages,
> > but it will only be 1/3 the performance if the network is saturated.
> > Also, the lookup is not guaranteed to be only one message.  It might be
> > as many as N (the number of bricks), so by the reasoning above the
> > performance would only drop to N/N+2.  I think the real situation is a
> > bit more complicated - and less dire - than you suggest.
> 
> As per what I heard, when quick read (Now divided as open-behind and
> quick-read) was introduced webserver use case users reported 300% to
> 400% perf improvement.

I second that. Even I've heard similar improvements for webserver use cases 
(Quick read was first written with apache as the use case). I tried looking for 
any previous data on this, but unfortunately couldn't find any. But 
nevertheless, we can do some performance benchmark ourselves.

> We should definitely test it once we have enough code to do so. I am
> just giving a heads up.
> 
> Having said that, for 'tar' I think we can most probably do a better job
> in dht2 because even after readdirp a nameless lookup comes. If it has
> GF_CONTENT_KEY we should send it to data cluster directly. For webserver
> usecase I don't have any ideas.
> 
> At least on my laptop this is what I saw, on a setup with different
> client, server machines, situation could be worse. This is distribute
> volume with one brick.
> 
> root@localhost - /mnt/d1
> 19:42:52 :) ⚡ time tar cf a.tgz a
> 
> real0m6.987s
> user0m0.089s
> sys0m0.481s
> 
> root@localhost - /mnt/d1
> 19:43:22 :) ⚡ cd
> 
> root@localhost - ~
> 19:43:25 :) ⚡ umount /mnt/d1
> 
> root@localhost - ~
> 19:43:27 :) ⚡ gluster volume set d1 open-behind off
> volume set: success
> 
> root@localhost - ~
> 19:43:47 :) ⚡ gluster volume set d1 quick-read off
> volume set: success
> 
> root@localhost - ~
> 19:44:03 :( ⚡ gluster volume stop d1
> Stopping volume will make its data inaccessible. Do you want to
> continue? (y/n) y
> volume stop: d1: success
> 
> root@localhost - ~
> 19:44:09 :) ⚡ gluster volume start d1
> volume start: d1: success
> 
> root@localhost - ~
> 19:44:13 :) ⚡ mount -t glusterfs localhost.localdomain:/d1 /mnt/d1
> 
> root@localhost - ~
> 19:44:29 :) ⚡ cd /mnt/d1
> 
> root@localhost - /mnt/d1
> 19:44:30 :) ⚡ time tar cf b.tgz a
> 
> real0m12.176s
> user0m0.098s
> sys0m0.582s
> 
> Pranith
> >
> >> I suggest that we send some fop at the
> >> time of open to data cluster and change quick-read to cache this data on
> >> open (if not already) then we can reduce the perf hit to 1/2 of current
> >> perf, i.e. lookup+open.
> > At first glance, it seems pretty simple to do something like this, and
> > pretty obvious that we should.  The tricky question is: where should we
> > send that other op, before lookup has told us where the partition
> > containing that file is?  If there's some reasonable guess we can make,
> > the sending an open+read in parallel with the lookup will be helpful.
> > If not, then it will probably be a waste of time and network resources.
> > Shyam, is enough of this information being cached *on the clients* to
> > make this effective?
> Pranith
> ___
> Gluster-devel mailing list
> Gluster-devel@gluster.org
> http://www.gluster.org/mailman/listinfo/gluster-devel
___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-devel

Re: [Gluster-devel] regarding GF_CONTENT_KEY and dht2 - perf with small files

2016-02-02 Thread Shyam

On 02/02/2016 06:22 PM, Jeff Darcy wrote:

   Background: Quick-read + open-behind xlators are developed to help
in small file workload reads like apache webserver, tar etc to get the
data of the file in lookup FOP itself. What happens is, when a lookup
FOP is executed, GF_CONTENT_KEY is added in xdata with max-length and
posix xlator reads the file and fills the data in xdata response if this
key is present as long as the file-size is less than max-length given in
the xdata. So when we do a tar of something like a kernel tree with
small files, if we look at profile of the bricks all we see are lookups.
OPEN + READ fops will not be sent at all over the network.

   With dht2 because data is present on a different cluster. We can't
get the data in lookup. Shyam was telling me that opens are also sent to
metadata cluster. That will make perf in this usecase back to where it
was before introducing these two features i.e. 1/3 of current perf
(Lookup vs lookup+open+read)


This is interesting thanks for the heads up.



Is "1/3 of current perf" based on actual measurements?  My understanding
was that the translators in question exist to send requests *in parallel*
with the original lookup stream.  That means it might be 3x the messages,
but it will only be 1/3 the performance if the network is saturated.
Also, the lookup is not guaranteed to be only one message.  It might be
as many as N (the number of bricks), so by the reasoning above the
performance would only drop to N/N+2.  I think the real situation is a
bit more complicated - and less dire - than you suggest.


I suggest that we send some fop at the
time of open to data cluster and change quick-read to cache this data on
open (if not already) then we can reduce the perf hit to 1/2 of current
perf, i.e. lookup+open.


At first glance, it seems pretty simple to do something like this, and
pretty obvious that we should.  The tricky question is: where should we
send that other op, before lookup has told us where the partition
containing that file is?  If there's some reasonable guess we can make,
the sending an open+read in parallel with the lookup will be helpful.
If not, then it will probably be a waste of time and network resources.
Shyam, is enough of this information being cached *on the clients* to
make this effective?



The file data would be located based on its GFID, so before the *first* 
lookup/stat for a file, there is no way to know it's GFID.
NOTE: Instead of a name hash the GFID hash is used, to get immunity 
against renames and the like, as a name hash could change the location 
information for the file (among other reasons).


The open+read can be done as a single FOP,
  - open for a read only case can do access checking on the client to 
allow the FOP to proceed to the DS without hitting the MDS for an open token


The client side cache is important from this and other such 
perspectives. It should also leverage upcall infra to keep the cache 
loosely coherent.


One thing to note here would be, for the client to do a lookup (where 
the file name should be known before hand), either a readdir/(p) has to 
have happened, or the client knows the name already (say application 
generated names). For the former (readdir case), there is enough 
information on the client to not need a lookup, but rather just do the 
open+read on the DS. For the latter the first lookup cannot be avoided, 
degrading this to a lookup+(open+read).


Some further tricks can be done to do readdir prefetching on such 
workloads, as the MDS runs on a DB (eventually), piggybacking more 
entries than requested on a lookup. I would possibly leave that for 
later, based on performance numbers in the small file area.


Shyam
___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-devel


Re: [Gluster-devel] regarding GF_CONTENT_KEY and dht2 - perf with small files

2016-02-02 Thread Pranith Kumar Karampuri



On 02/02/2016 06:22 PM, Jeff Darcy wrote:

   Background: Quick-read + open-behind xlators are developed to help
in small file workload reads like apache webserver, tar etc to get the
data of the file in lookup FOP itself. What happens is, when a lookup
FOP is executed, GF_CONTENT_KEY is added in xdata with max-length and
posix xlator reads the file and fills the data in xdata response if this
key is present as long as the file-size is less than max-length given in
the xdata. So when we do a tar of something like a kernel tree with
small files, if we look at profile of the bricks all we see are lookups.
OPEN + READ fops will not be sent at all over the network.

   With dht2 because data is present on a different cluster. We can't
get the data in lookup. Shyam was telling me that opens are also sent to
metadata cluster. That will make perf in this usecase back to where it
was before introducing these two features i.e. 1/3 of current perf
(Lookup vs lookup+open+read)

Is "1/3 of current perf" based on actual measurements?  My understanding
was that the translators in question exist to send requests *in parallel*
with the original lookup stream.  That means it might be 3x the messages,
but it will only be 1/3 the performance if the network is saturated.
Also, the lookup is not guaranteed to be only one message.  It might be
as many as N (the number of bricks), so by the reasoning above the
performance would only drop to N/N+2.  I think the real situation is a
bit more complicated - and less dire - than you suggest.


As per what I heard, when quick read (Now divided as open-behind and 
quick-read) was introduced webserver use case users reported 300% to 
400% perf improvement.
We should definitely test it once we have enough code to do so. I am 
just giving a heads up.


Having said that, for 'tar' I think we can most probably do a better job 
in dht2 because even after readdirp a nameless lookup comes. If it has 
GF_CONTENT_KEY we should send it to data cluster directly. For webserver 
usecase I don't have any ideas.


At least on my laptop this is what I saw, on a setup with different 
client, server machines, situation could be worse. This is distribute 
volume with one brick.


root@localhost - /mnt/d1
19:42:52 :) ⚡ time tar cf a.tgz a

real0m6.987s
user0m0.089s
sys0m0.481s

root@localhost - /mnt/d1
19:43:22 :) ⚡ cd

root@localhost - ~
19:43:25 :) ⚡ umount /mnt/d1

root@localhost - ~
19:43:27 :) ⚡ gluster volume set d1 open-behind off
volume set: success

root@localhost - ~
19:43:47 :) ⚡ gluster volume set d1 quick-read off
volume set: success

root@localhost - ~
19:44:03 :( ⚡ gluster volume stop d1
Stopping volume will make its data inaccessible. Do you want to 
continue? (y/n) y

volume stop: d1: success

root@localhost - ~
19:44:09 :) ⚡ gluster volume start d1
volume start: d1: success

root@localhost - ~
19:44:13 :) ⚡ mount -t glusterfs localhost.localdomain:/d1 /mnt/d1

root@localhost - ~
19:44:29 :) ⚡ cd /mnt/d1

root@localhost - /mnt/d1
19:44:30 :) ⚡ time tar cf b.tgz a

real0m12.176s
user0m0.098s
sys0m0.582s

Pranith



I suggest that we send some fop at the
time of open to data cluster and change quick-read to cache this data on
open (if not already) then we can reduce the perf hit to 1/2 of current
perf, i.e. lookup+open.

At first glance, it seems pretty simple to do something like this, and
pretty obvious that we should.  The tricky question is: where should we
send that other op, before lookup has told us where the partition
containing that file is?  If there's some reasonable guess we can make,
the sending an open+read in parallel with the lookup will be helpful.
If not, then it will probably be a waste of time and network resources.
Shyam, is enough of this information being cached *on the clients* to
make this effective?

Pranith
___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-devel

Re: [Gluster-devel] regarding GF_CONTENT_KEY and dht2 - perf with small files

2016-02-02 Thread Jeff Darcy
>   Background: Quick-read + open-behind xlators are developed to help
> in small file workload reads like apache webserver, tar etc to get the
> data of the file in lookup FOP itself. What happens is, when a lookup
> FOP is executed, GF_CONTENT_KEY is added in xdata with max-length and
> posix xlator reads the file and fills the data in xdata response if this
> key is present as long as the file-size is less than max-length given in
> the xdata. So when we do a tar of something like a kernel tree with
> small files, if we look at profile of the bricks all we see are lookups.
> OPEN + READ fops will not be sent at all over the network.
> 
>   With dht2 because data is present on a different cluster. We can't
> get the data in lookup. Shyam was telling me that opens are also sent to
> metadata cluster. That will make perf in this usecase back to where it
> was before introducing these two features i.e. 1/3 of current perf
> (Lookup vs lookup+open+read)

Is "1/3 of current perf" based on actual measurements?  My understanding
was that the translators in question exist to send requests *in parallel*
with the original lookup stream.  That means it might be 3x the messages,
but it will only be 1/3 the performance if the network is saturated.
Also, the lookup is not guaranteed to be only one message.  It might be
as many as N (the number of bricks), so by the reasoning above the
performance would only drop to N/N+2.  I think the real situation is a
bit more complicated - and less dire - than you suggest.

> I suggest that we send some fop at the
> time of open to data cluster and change quick-read to cache this data on
> open (if not already) then we can reduce the perf hit to 1/2 of current
> perf, i.e. lookup+open.

At first glance, it seems pretty simple to do something like this, and
pretty obvious that we should.  The tricky question is: where should we
send that other op, before lookup has told us where the partition
containing that file is?  If there's some reasonable guess we can make,
the sending an open+read in parallel with the lookup will be helpful.
If not, then it will probably be a waste of time and network resources.
Shyam, is enough of this information being cached *on the clients* to
make this effective?
___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-devel


[Gluster-devel] regarding GF_CONTENT_KEY and dht2 - perf with small files

2016-02-01 Thread Pranith Kumar Karampuri

hi,
 Background: Quick-read + open-behind xlators are developed to help 
in small file workload reads like apache webserver, tar etc to get the 
data of the file in lookup FOP itself. What happens is, when a lookup 
FOP is executed, GF_CONTENT_KEY is added in xdata with max-length and 
posix xlator reads the file and fills the data in xdata response if this 
key is present as long as the file-size is less than max-length given in 
the xdata. So when we do a tar of something like a kernel tree with 
small files, if we look at profile of the bricks all we see are lookups. 
OPEN + READ fops will not be sent at all over the network.


 With dht2 because data is present on a different cluster. We can't 
get the data in lookup. Shyam was telling me that opens are also sent to 
metadata cluster. That will make perf in this usecase back to where it 
was before introducing these two features i.e. 1/3 of current perf 
(Lookup vs lookup+open+read). I suggest that we send some fop at the 
time of open to data cluster and change quick-read to cache this data on 
open (if not already) then we can reduce the perf hit to 1/2 of current 
perf, i.e. lookup+open.


 Sorry if this was already discussed and I didn't pay attention.

Pranith
___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-devel