----- Original Message ----- > From: "Vijay Bellur" <[email protected]> > To: "Krutika Dhananjay" <[email protected]> > Cc: "Gluster Devel" <[email protected]> > Sent: Tuesday, February 24, 2015 4:13:13 PM > Subject: Re: [Gluster-devel] Sharding - Inode write fops - recoverability > from failures - design
> On 02/24/2015 01:53 PM, Krutika Dhananjay wrote: > > > > > > ------------------------------------------------------------------------ > > > > *From: *"Vijay Bellur" <[email protected]> > > *To: *"Krutika Dhananjay" <[email protected]> > > *Cc: *"Gluster Devel" <[email protected]> > > *Sent: *Tuesday, February 24, 2015 12:26:58 PM > > *Subject: *Re: [Gluster-devel] Sharding - Inode write fops - > > recoverability from failures - design > > > > On 02/24/2015 12:19 PM, Krutika Dhananjay wrote: > > > > > > > > > > > ------------------------------------------------------------------------ > > > > > > *From: *"Vijay Bellur" <[email protected]> > > > *To: *"Krutika Dhananjay" <[email protected]> > > > *Cc: *"Gluster Devel" <[email protected]> > > > *Sent: *Tuesday, February 24, 2015 11:35:28 AM > > > *Subject: *Re: [Gluster-devel] Sharding - Inode write fops - > > > recoverability from failures - design > > > > > > On 02/24/2015 10:36 AM, Krutika Dhananjay wrote: > > > > > > > > > > > > > > > > > ------------------------------------------------------------------------ > > > > > > > > *From: *"Vijay Bellur" <[email protected]> > > > > *To: *"Krutika Dhananjay" <[email protected]>, > > "Gluster Devel" > > > > <[email protected]> > > > > *Sent: *Monday, February 23, 2015 5:25:57 PM > > > > *Subject: *Re: [Gluster-devel] Sharding - Inode write > > fops - > > > > recoverability from failures - design > > > > > > > > On 02/22/2015 06:08 PM, Krutika Dhananjay wrote: > > > > > Hi, > > > > > > > > > > Please find the design doc for one of the problems in > > > sharding which > > > > > Pranith and I are trying to solve and its solution @ > > > > > http://review.gluster.org/#/c/9723/1. > > > > > Reviews and feedback are much appreciated. > > > > > > > > > > > > > Can this feature be made optional? I think there are use > > > cases like > > > > virtual machine image storage, hdfs etc. where the > > number of > > > metadata > > > > queries might not be very high. It would be an acceptable > > > tradeoff in > > > > such cases to not be very efficient for answering metadata > > > queries but > > > > be very efficient for data operations. > > > > > > > > IOW, can we have two possible modes of operation for > > the sharding > > > > translator to answer metadata queries? > > > > > > > > 1. One that behaves like a regular filesystem where we > > expect > > > a mix of > > > > data and metadata operations. Your document seems to cover > > > that part > > > > well. We can look at optimizing behavior for > > multi-threaded > > > single > > > > writer use cases after an initial implementation is in > > place. > > > > Techniques > > > > like eager locking can be applied here. > > > > > > > > 2. Another mode where we do not expect a lot of metadata > > > queries. In > > > > this mode, we can visit all nodes where we have shards to > > > answer these > > > > queries. > > > > > > > > But for sharding translator to be able to visit all > > shards, it is > > > > required to know the last shard number. > > > > Without this, it will never know when to stop looking up the > > > different > > > > shards. For this to happen, we > > > > still need to maintain the size attribute for each file. > > > > > > > > > > Wouldn't maintaining the total number of shards in the metadata > > > shard be > > > sufficient? > > > > > > Maintaining the correctness of "total number of shards" would again > > > incur the same cost as maintaining size or any other metadata > > attribute > > > if a client/brick crashes in the middle of a write fop before the > > > attribute is committed to disk. > > > In other words, we will again need to maintain a "dirty" and > > "committed" > > > copy of the shard_count to ensure its correctness. > > > > > > > I think the cost of maintaining "total number of shards" is not as > > expensive as maintaining size or any other metadata attribute. The > > shard > > count needs to be updated only when an extending operation results in > > the creation of a new shard or when a truncate operation results in the > > removal of a shard. Maintaining other metadata attributes would need > > a 5 > > phase transaction for every write operation. Isn't that the case? > > > > Even size attribute changes only in case of extending writes and > > truncates. In fact, Pranith and I had > > initially chosen to persist shard count as opposed to size in the first > > design for inode write fops. > > But the reason we decided to go with size in the end is to prevent extra > > lookup on the last shard to > > find the total size of the file (i.e., if N is the total number of > > shards, file size = (N-1)*shard_block_size + sizeof(last shard)). > > > I am probably confused about the definition of size. By size, I mean the total size of the file in bytes. > For maintaining > accurate size, wouldn't we need to account for truncates and writes that > happen within the scope of one shard? Correct. This particular increase/decrease in size can be deduced from the change in ia_size between postbuf and prebuf in the respective callback. -Krutika > -Vijay
_______________________________________________ Gluster-devel mailing list [email protected] http://www.gluster.org/mailman/listinfo/gluster-devel
