All,
As pointed by [1], parallel writes can result in incorrect quota enforcement.
[2] was an (unsuccessful) attempt to solve the issue. Some points about [2]:
in_progress_writes is updated _after_ we fetch the size. Due to this, two
writes can see the same size and hence the issue is not solved. What we should
be doing is to update in_progress_writes even before we fetch the size. If we
do this, it is guaranteed that at-least one write sees the other's size
accounted in in_progress_writes. This approach has two issues:
1. since we had added current write size to in_progress_writes, current write
would already be accounted in the size of the directory. This is a minor issue
and can be solved by subtracting the size of the current write from the
resultant cluster-wide in-progress-size of the directory.
2. We might prematurely fail the writes even though there is some space
available. Assume there is a 5MB of free space. If two 5MB writes are issued in
parallel, both might fail as both might see each other's size already
accounted, though none of them has succeeded. To solve this issue, I am
proposing following algo:
* we assign an identity that is unique across the cluster for each write -
say uuid
* Among all the in-progress-writes we pick a write. The policy used can be a
random criteria like smallest of all the uuids. So, each brick selects a
candidate among its own in-progress-writes _AND_ incoming candidate (see the
psuedocode of get_dir_size below for more clarity). It sends back this
candidate along with size of directory. The brick also remembers the last
candidate it approved. clustering translators like dht pick one write among
these replies, using the same logic bricks had used. Now along with size we
also get a candidate to choose from in-progress writes. However, there might be
a new write on the brick in the time-window where we try to fetch size which
could be the candidate. We should compare the resultant cluster_wide candidate
with the per-brick candidate. So, the enforcement logic will be as below:
/* Both enforcer and get_dir_size are executed in brick process. I've left out
logic of get_dir_size in cluster translators like dht */
enforcer ()
{
/* Note that this logic is executed independently for each directory on
which quota limit is set. All the in-progress writes, sizes, candidates are
valid in the context of
that directory
*/
my_delta = iov_length (input_iovec, input_count);
my_id = getuuid();
add_my_delta_to_in_progress_size ();
get_dir_size (my_id, &size, &in_progress_size, &cluster_candidate);
in_progress_size -= my_delta;
if (((size + my_delta) < quota_limit) && ((size + in_progress_size +
my_delta) > quota_limit) {
/* we've to choose among in-progress writes */
brick_candidate = least_of_uuids (directory->in_progress_write_list,
directory->last_winning_candidate);
if ((my_id == cluster_candidate) && (my_id == brick_candidate)) {
/* 1. subtract my_delta from per-brick in-progress writes
2. add my_delta to per-brick sizes of all parents
3. allow-write
getting brick_candidate above, 1 and 2 should be done
atomically
*/
} else {
/* 1. subtract my_delta from per-brick in-progress writes
2. fail_write
*/
} else if ((size + my_delta) < quota_limit) {
/* 1. subtract my_delta from per-brick in-progress writes
2. add my_delta to per-brick sizes of all parents
3. allow-write
1 and 2 should be done atomically
*/
} else {
fail_write ();
}
}
get_dir_size (IN incoming_candidate_id, IN directory, OUT *winning_candidate,
...)
{
directory->last_winning_candidate = winning_candidate = least_uuid
(directory->in_progress_write_list, incoming_candidate_id);
....
}
Comments?
[1] http://www.gluster.org/pipermail/gluster-devel/2015-May/045194.html
[2] http://review.gluster.org/#/c/6220/
regards,
Raghavendra.
_______________________________________________
Gluster-devel mailing list
[email protected]
http://www.gluster.org/mailman/listinfo/gluster-devel