responses inline ... On Tue, Nov 4, 2014 at 8:00 AM, Ben England <[email protected]> wrote: > inline... > >> > > I have a parallel-libgfapi benchmark that could be modified to fit the new > API, and could test performance of it. > > https://github.com/bengland2/parallel-libgfapi >
Not familiar with it however will look into it and play with it to see how to use it. >> > > libgfapi is a good place to prototype, it's easy to change libgfapi by adding > to the existing calls, but this won't help performance as much as you might > want unless the Gluster protocol can somehow change to allow combination of > several separate FOPS such as LOOKUP, OPEN, READ and RELEASE FOPS and LOOKUP, > CREATE, WRITE and RELEASE FOPS. That's the hard part IMHO. I suggest using > wireshark to watch Gluster small-file creates, and then try to understand > what each FOP is doing and why it is there. > I tried to do some wireshark captures by having a few bricks and looking at the calls - I tried to workout the alternate interface trying to stay away from the standard calls - for an atomic read/write the overhead from having multiple calls could be combined into one is what I was feeling. > suggestions for protocol enhancement: > > Can we allow CREATE to piggyback write data if it's under 128 KB or whatever > RPC size limit is, and optionally do a RELEASE after the WRITE? Or just > create a new FOP that does that? Can we also specify xattrs that the > application might want to set at create time? Example, SMB security-related > XATTRs, Swift metadata. > Does it make sense to take existing fop's and batch them for submission then? Is that even possible? I was thinking of a new FOP that could use pack multiple atomic operations (primarily because the other operations seem to be taking a certain path in the code and adding complexity to it may disturb it for no good reason). > Can we do something like we did for sequential writes with eager-lock, and > allow Gluster client to hang on to directory lock for a little while so that > we don't have to continually reacquire the lock if we are going to keep > creating files in it? > > Second, if we already have a write lock on the directory, we shouldn't have > to do LOOKUP then CREATE, just do CREATE directly. > If the request is a single atomic operation does this still hold? > Finally, Swift and other apps use hack of rename() call after close() so that > they can create a file atomically, if we had an API for creating files > atomically then these apps would not be forced into using the expensive > rename operation. > > Can we do these things in an incremental way so that we can steadily improve > performance over time without massive disruption to code base? > > Perhaps Glusterfs FUSE mount could learn to do something like that as well > with a special mount option that would allow actual create at server to be > deferred until any one of these 3 conditions occurred: > > - 100 msec had passed, or > - the file was closed, or > - at least N KB of data was written (i.e. an RPC's worth) > > > This is a bit like Nagle's algorithm in TCP, which allows TCP to aggregate > more data into segments before it actually transmits them. It technically > violates POSIX and creates some semantic issues (how do you tell user that > file already exists, for example?), but frankly fs interface in POSIX is an > anachronism, we need to bend it a little to get what we need, NFS already > does. This might not be appropriate for all apps but there might be quite a > few cases like initial data ingest where this would be a very reasonable > thing to do. > > >> The following is what I was thinking - please feel free to correct me >> or guide me if someone has already done some ground work on this. >> >> For read, multiple objects can be provided and they should be >> separated for read from appropriate brick based on the DHT flag - this >> will help avoid multiple lookups from all servers. In the absence of >> DHT they would be sent to all but only the ones that contain the >> object respond (it's more like a multiple file lookup request). >> > > I think it is very ambitious to batch creates for multiple files, and this > greatly complicates the API. Let's just get to a point where we can create > a Gluster file and write the data for it in the same libgfapi call and have > that work efficiently in the Gluster RPC interface -- this would be a huge > win. > Agreed. >> For write, same as the case of read, complete object writes (no >> partial updates, file offsets etc.) >> >> For delete, most of the lookup and batching logic remains the same. >> > > Delete is not the highest priority thing here. Creates are the worst > performers, so we probably should focus on creates. someday it would be nice > to be able to express the thought to the file system "delete this directory > tree" or "delete all files within this directory", since Gluster could then > make that a parallel operation, hence scalable. > -Siva _______________________________________________ Gluster-devel mailing list [email protected] http://supercolony.gluster.org/mailman/listinfo/gluster-devel
