responses inline ...

On Tue, Nov 4, 2014 at 8:00 AM, Ben England <[email protected]> wrote:
> inline...
>
>>
>
> I have a parallel-libgfapi benchmark that could be modified to fit the new 
> API, and could test performance of it.
>
> https://github.com/bengland2/parallel-libgfapi
>

Not familiar with it however will look into it and play with it to see
how to use it.

>>
>
> libgfapi is a good place to prototype, it's easy to change libgfapi by adding 
> to the existing calls, but this won't help performance as much as you might 
> want unless the Gluster protocol can somehow change to allow combination of 
> several separate FOPS such as LOOKUP, OPEN, READ and RELEASE FOPS and LOOKUP, 
> CREATE, WRITE and RELEASE FOPS.   That's the hard part IMHO.  I suggest using 
> wireshark to watch Gluster small-file creates, and then try to understand 
> what each FOP is doing and why it is there.
>

I tried to do some wireshark captures by having a few bricks and
looking at the calls - I tried to workout the alternate interface
trying to stay away from the standard calls - for an atomic read/write
the overhead from having multiple calls could be combined into one is
what I was feeling.

> suggestions for protocol enhancement:
>
> Can we allow CREATE to piggyback write data if it's under 128 KB or whatever 
> RPC size limit is, and optionally do a RELEASE after the WRITE?  Or just 
> create a new FOP that does that?  Can we also specify xattrs that the 
> application might want to set at create time?   Example, SMB security-related 
> XATTRs, Swift metadata.
>
Does it make sense to take existing fop's and batch them for
submission then? Is that even possible?

I was thinking of a new FOP that could use pack multiple atomic
operations (primarily because the other operations seem to be taking a
certain path in the code and adding complexity to it may disturb it
for no good reason).

> Can we do something like we did for sequential writes with eager-lock, and 
> allow Gluster client to hang on to directory lock for a little while so that 
> we don't have to continually reacquire the lock if we are going to keep 
> creating files in it?
>
> Second, if we already have a write lock on the directory, we shouldn't have 
> to do LOOKUP then CREATE, just do CREATE directly.
>

If the request is a single atomic operation does this still hold?

> Finally, Swift and other apps use hack of rename() call after close() so that 
> they can create a file atomically, if we had an API for creating files 
> atomically then these apps would not be forced into using the expensive 
> rename operation.
>
> Can we do these things in an incremental way so that we can steadily improve 
> performance over time without massive disruption to code base?
>
> Perhaps Glusterfs FUSE mount could learn to do something like that as well 
> with a special mount option that would allow actual create at server to be 
> deferred until any one of these 3 conditions occurred:
>
> - 100 msec had passed, or
> - the file was closed, or
> - at least N KB of data was written (i.e. an RPC's worth)
>
>
> This is a bit like Nagle's algorithm in TCP, which allows TCP to aggregate 
> more data into segments before it actually transmits them.  It technically 
> violates POSIX and creates some semantic issues (how do you tell user that 
> file already exists, for example?), but frankly fs interface in POSIX is an 
> anachronism, we need to bend it a little to get what we need, NFS already 
> does.  This might not be appropriate for all apps but there might be quite a 
> few cases like initial data ingest where this would be a very reasonable 
> thing to do.
>
>
>> The following is what I was thinking - please feel free to correct me
>> or guide me if someone has already done some ground work on this.
>>
>> For read, multiple objects can be provided and they should be
>> separated for read from appropriate brick based on the DHT flag - this
>> will help avoid multiple lookups from all servers. In the absence of
>> DHT they would be sent to all but only the ones that contain the
>> object respond (it's more like a multiple file lookup request).
>>
>
> I think it is very ambitious to batch creates for multiple files, and this 
> greatly complicates the API.   Let's just get to a point where we can create 
> a Gluster file and write the data for it in the same libgfapi call and have 
> that work efficiently in the Gluster RPC interface -- this would be a huge 
> win.
>

Agreed.

>> For write, same as the case of read, complete object writes (no
>> partial updates, file offsets etc.)
>>
>> For delete, most of the lookup and batching logic remains the same.
>>
>
> Delete is not the highest priority thing here.  Creates are the worst 
> performers, so we probably should focus on creates.  someday it would be nice 
> to be able to express the thought to the file system "delete this directory 
> tree" or "delete all files within this directory", since Gluster could then 
> make that a parallel operation, hence scalable.
>
-Siva
_______________________________________________
Gluster-devel mailing list
[email protected]
http://supercolony.gluster.org/mailman/listinfo/gluster-devel

Reply via email to