Re: [Gluster-devel] Feature help

Ben England Tue, 04 Nov 2014 05:01:32 -0800

inline...

----- Original Message -----
> From: "Rudra Siva" <rudrasiv...@gmail.com>
> To: gluster-devel@gluster.org
> Sent: Saturday, November 1, 2014 10:20:41 AM
> Subject: [Gluster-devel] Feature help
> 
> Hi,
> 
> I'm very interested in helping with this feature by way of development
> help, testing and or benchmarking.
>


I have a parallel-libgfapi benchmark that could be modified to fit the new API, 
and could test performance of it.

https://github.com/bengland2/parallel-libgfapi

> Features/Feature Smallfile Perf
> 
> One of the things I was looking into was possibility of adding a few
> API calls to libgfapi to help allow reading and writing multiple small
> files as objects - just as librados does for ceph - cutting out FUSE
> and other semantics that tend to be overheads for really small files.
> I don't know what else I will have to add for libgfapi to support
> this.
> 

libgfapi is a good place to prototype, it's easy to change libgfapi by adding 
to the existing calls, but this won't help performance as much as you might 
want unless the Gluster protocol can somehow change to allow combination of 
several separate FOPS such as LOOKUP, OPEN, READ and RELEASE FOPS and LOOKUP, 
CREATE, WRITE and RELEASE FOPS.   That's the hard part IMHO.  I suggest using 
wireshark to watch Gluster small-file creates, and then try to understand what 
each FOP is doing and why it is there.  

suggestions for protocol enhancement:

Can we allow CREATE to piggyback write data if it's under 128 KB or whatever 
RPC size limit is, and optionally do a RELEASE after the WRITE?  Or just create 
a new FOP that does that?  Can we also specify xattrs that the application 
might want to set at create time?   Example, SMB security-related XATTRs, Swift 
metadata.

Can we do something like we did for sequential writes with eager-lock, and 
allow Gluster client to hang on to directory lock for a little while so that we 
don't have to continually reacquire the lock if we are going to keep creating 
files in it?

Second, if we already have a write lock on the directory, we shouldn't have to 
do LOOKUP then CREATE, just do CREATE directly.

Finally, Swift and other apps use hack of rename() call after close() so that 
they can create a file atomically, if we had an API for creating files 
atomically then these apps would not be forced into using the expensive rename 
operation.

Can we do these things in an incremental way so that we can steadily improve 
performance over time without massive disruption to code base?

Perhaps Glusterfs FUSE mount could learn to do something like that as well with 
a special mount option that would allow actual create at server to be deferred 
until any one of these 3 conditions occurred:

- 100 msec had passed, or 
- the file was closed, or
- at least N KB of data was written (i.e. an RPC's worth)

This is a bit like Nagle's algorithm in TCP, which allows TCP to aggregate more 
data into segments before it actually transmits them.  It technically violates 
POSIX and creates some semantic issues (how do you tell user that file already 
exists, for example?), but frankly fs interface in POSIX is an anachronism, we 
need to bend it a little to get what we need, NFS already does.  This might not 
be appropriate for all apps but there might be quite a few cases like initial 
data ingest where this would be a very reasonable thing to do.


> The following is what I was thinking - please feel free to correct me
> or guide me if someone has already done some ground work on this.
> 
> For read, multiple objects can be provided and they should be
> separated for read from appropriate brick based on the DHT flag - this
> will help avoid multiple lookups from all servers. In the absence of
> DHT they would be sent to all but only the ones that contain the
> object respond (it's more like a multiple file lookup request).
> 

I think it is very ambitious to batch creates for multiple files, and this 
greatly complicates the API.   Let's just get to a point where we can create a 
Gluster file and write the data for it in the same libgfapi call and have that 
work efficiently in the Gluster RPC interface -- this would be a huge win.  

> For write, same as the case of read, complete object writes (no
> partial updates, file offsets etc.)
> 
> For delete, most of the lookup and batching logic remains the same.
>

Delete is not the highest priority thing here.  Creates are the worst 
performers, so we probably should focus on creates.  someday it would be nice 
to be able to express the thought to the file system "delete this directory 
tree" or "delete all files within this directory", since Gluster could then 
make that a parallel operation, hence scalable.

> I can help with testing, documentation or benchmarks if someone has
> already done some work.
> 
> -Siva
> _______________________________________________
> Gluster-devel mailing list
> Gluster-devel@gluster.org
> http://supercolony.gluster.org/mailman/listinfo/gluster-devel
> 
_______________________________________________
Gluster-devel mailing list
Gluster-devel@gluster.org
http://supercolony.gluster.org/mailman/listinfo/gluster-devel

Re: [Gluster-devel] Feature help

Reply via email to