Re: [Gluster-devel] Rebalance data migration and corruption

2016-02-10 Thread Raghavendra Gowdappa
> >
> > hmm.. I would prefer an infinite timeout. The only scenario where brick
> > process can forcefully flush leases would be connection lose with
> > rebalance process. The more scenarios where brick can flush leases
> > without knowledge of rebalance process, we open up more race-windows for
> > this bug to occur.
> >
> > In fact at least in theory to be correct, rebalance process should
> > replay all the transactions that happened during the lease which got
> > flushed out by brick (after re-acquiring that lease). So, we would like
> > to avoid any such scenarios.
> >
> > Btw, what is the necessity of timeouts? Is it an insurance against rogue
> > clients who won't respond back to lease recalls?
> yes. It is to protect from rogue clients and prevent starvation of other
> clients.
> 
> In the current design, every lease is associated with lease-id (like
> lockowner in case of locks) and all the further fops (I/Os) have to be
> done using this lease-id. So in case if any fop comes to brick process
> with the lease-id of the lease which got flushed by the brick process,
> we can send special error and rebalance process can then replay all
> those fops. Will that be sufficient?

How do I pass lease-id in a fop like readv? Should I pass it in xdata? This is 
sufficient for rebalance process. It can follow following algo:

1. Acquire a read-lease on the entire file on src.
2. Note the offset at which this transaction has started. Initially it'll be 
zero. But if leases were recalled, the offset will be the continuation from 
where last transaction left off.
3. Do multiple (read, src) and (write, dst).
4. If (read, src) returns an error (because of lease being flushed), Goto step 
1 and start the transaction from offset remembered in step 2. Note that we 
don't update the offset here and we replay this failed transaction again. We 
update offset only on successful unlock.

On receiving a lease-recall notification from brick, rebalance process does:
1. Note the offset till which it has successfully copied file from src to dst.
2. Make sure atleast one (read, src) and (write, dst) is done since we last 
acquired the lease (at least a best effort). This will ensure that rebalance 
process won't get stuck in an infinite loop.
3. Issue an unlock. If unlock is successful, next transaction will continue 
from offset noted in 1. Else, this transaction is considered a failure and 
rebalance process behaves exactly the same way as read failed above because of 
lease expiry.

In this algo, to avoid rebalance process getting stuck in infinite loop, we 
should make sure unlocks are successful (to the extent they can be made 
successful). We can also add max-number of retries for transaction on same 
region of file and fail the migration once we exceed so many retries.

> 
> CCin Poornima who has been implementing it.
> 
> 
> Thanks,
> Soumya
> 
___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-devel


Re: [Gluster-devel] Rebalance data migration and corruption

2016-02-09 Thread Soumya Koduri



On 02/09/2016 12:30 PM, Raghavendra G wrote:

   Right. But if there are simultaneous access to the same file from


 any other client and rebalance process, delegations shall
not be
 granted or revoked if granted even though they are operating at
 different offsets. So if you rely only on delegations,
migration may
 not proceed if an application has held a lock or doing any
I/Os.


Does the brick process wait for the response of delegation holder
(rebalance process here) before it wipes out the
delegation/locks? If
that's the case, rebalance process can complete one transaction of
(read, src) and (write, dst) before responding to a delegation
recall.
That way there is no starvation for both applications and rebalance
process (though this makes both of them slower, but that cannot
helped I
think).


yes. Brick process should wait for certain period before revoking
the delegations forcefully in case if it is not returned by the
client. Also if required (like done by NFS servers) we can choose to
increase this timeout value at run time if the client is diligently
flushing the data.


hmm.. I would prefer an infinite timeout. The only scenario where brick
process can forcefully flush leases would be connection lose with
rebalance process. The more scenarios where brick can flush leases
without knowledge of rebalance process, we open up more race-windows for
this bug to occur.

In fact at least in theory to be correct, rebalance process should
replay all the transactions that happened during the lease which got
flushed out by brick (after re-acquiring that lease). So, we would like
to avoid any such scenarios.

Btw, what is the necessity of timeouts? Is it an insurance against rogue
clients who won't respond back to lease recalls?
yes. It is to protect from rogue clients and prevent starvation of other 
clients.


In the current design, every lease is associated with lease-id (like 
lockowner in case of locks) and all the further fops (I/Os) have to be 
done using this lease-id. So in case if any fop comes to brick process 
with the lease-id of the lease which got flushed by the brick process, 
we can send special error and rebalance process can then replay all 
those fops. Will that be sufficient?


CCin Poornima who has been implementing it.


Thanks,
Soumya
___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-devel


Re: [Gluster-devel] Rebalance data migration and corruption

2016-02-08 Thread Raghavendra Gowdappa


- Original Message -
> From: "Joe Julian" <j...@julianfamily.org>
> To: gluster-devel@gluster.org
> Sent: Monday, February 8, 2016 12:20:27 PM
> Subject: Re: [Gluster-devel] Rebalance data migration and corruption
> 
> Is this in current release versions?

Yes. This bug is present in currently released versions. However, it can happen 
only if writes from application are happening to a file when it is being 
migrated. So, vaguely one can say probability is less.

> 
> On 02/07/2016 07:43 PM, Shyam wrote:
> > On 02/06/2016 06:36 PM, Raghavendra Gowdappa wrote:
> >>
> >>
> >> - Original Message -
> >>> From: "Raghavendra Gowdappa" <rgowd...@redhat.com>
> >>> To: "Sakshi Bansal" <saban...@redhat.com>, "Susant Palai"
> >>> <spa...@redhat.com>
> >>> Cc: "Gluster Devel" <gluster-devel@gluster.org>, "Nithya
> >>> Balachandran" <nbala...@redhat.com>, "Shyamsundar
> >>> Ranganathan" <srang...@redhat.com>
> >>> Sent: Friday, February 5, 2016 4:32:40 PM
> >>> Subject: Re: Rebalance data migration and corruption
> >>>
> >>> +gluster-devel
> >>>
> >>>>
> >>>> Hi Sakshi/Susant,
> >>>>
> >>>> - There is a data corruption issue in migration code. Rebalance
> >>>> process,
> >>>>1. Reads data from src
> >>>>2. Writes (say w1) it to dst
> >>>>
> >>>>However, 1 and 2 are not atomic, so another write (say w2) to
> >>>> same region
> >>>>can happen between 1. But these two writes can reach dst in the
> >>>> order
> >>>>(w2,
> >>>>w1) resulting in a subtle corruption. This issue is not fixed
> >>>> yet and can
> >>>>cause subtle data corruptions. The fix is simple and involves
> >>>> rebalance
> >>>>process acquiring a mandatory lock to make 1 and 2 atomic.
> >>>
> >>> We can make use of compound fop framework to make sure we don't
> >>> suffer a
> >>> significant performance hit. Following will be the sequence of
> >>> operations
> >>> done by rebalance process:
> >>>
> >>> 1. issues a compound (mandatory lock, read) operation on src.
> >>> 2. writes this data to dst.
> >>> 3. issues unlock of lock acquired in 1.
> >>>
> >>> Please co-ordinate with Anuradha for implementation of this compound
> >>> fop.
> >>>
> >>> Following are the issues I see with this approach:
> >>> 1. features/locks provides mandatory lock functionality only for
> >>> posix-locks
> >>> (flock and fcntl based locks). So, mandatory locks will be
> >>> posix-locks which
> >>> will conflict with locks held by application. So, if an application
> >>> has held
> >>> an fcntl/flock, migration cannot proceed.
> >>
> >> We can implement a "special" domain for mandatory internal locks.
> >> These locks will behave similar to posix mandatory locks in that
> >> conflicting fops (like write, read) are blocked/failed if they are
> >> done while a lock is held.
> >>
> >>> 2. data migration will be less efficient because of an extra unlock
> >>> (with
> >>> compound lock + read) or extra lock and unlock (for non-compound fop
> >>> based
> >>> implementation) for every read it does from src.
> >>
> >> Can we use delegations here? Rebalance process can acquire a
> >> mandatory-write-delegation (an exclusive lock with a functionality
> >> that delegation is recalled when a write operation happens). In that
> >> case rebalance process, can do something like:
> >>
> >> 1. Acquire a read delegation for entire file.
> >> 2. Migrate the entire file.
> >> 3. Remove/unlock/give-back the delegation it has acquired.
> >>
> >> If a recall is issued from brick (when a write happens from mount),
> >> it completes the current write to dst (or throws away the read from
> >> src) to maintain atomicity. Before doing next set of (read, src) and
> >> (write, dst) tries to reacquire lock.
> >
> > With delegations this simplifies the normal path, when a file is
> > exclusively handled by rebalance. It also improves the case where a
> > client and rebalance are conflicting on a file, to degrade to
> > mandatory locks by either parties.
> >
> > I would prefer we take the delegation route for such needs in the future.
> >
> >>
> >> @Soumyak, can something like this be done with delegations?
> >>
> >> @Pranith,
> >> Afr does transactions for writing to its subvols. Can you suggest any
> >> optimizations here so that rebalance process can have a transaction
> >> for (read, src) and (write, dst) with minimal performance overhead?
> >>
> >> regards,
> >> Raghavendra.
> >>
> >>>
> >>> Comments?
> >>>
> >>>>
> >>>> regards,
> >>>> Raghavendra.
> >>>
> > ___
> > Gluster-devel mailing list
> > Gluster-devel@gluster.org
> > http://www.gluster.org/mailman/listinfo/gluster-devel
> 
> ___
> Gluster-devel mailing list
> Gluster-devel@gluster.org
> http://www.gluster.org/mailman/listinfo/gluster-devel
> 
___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-devel


Re: [Gluster-devel] Rebalance data migration and corruption

2016-02-08 Thread Soumya Koduri



On 02/08/2016 09:13 AM, Shyam wrote:

On 02/06/2016 06:36 PM, Raghavendra Gowdappa wrote:



- Original Message -

From: "Raghavendra Gowdappa" <rgowd...@redhat.com>
To: "Sakshi Bansal" <saban...@redhat.com>, "Susant Palai"
<spa...@redhat.com>
Cc: "Gluster Devel" <gluster-devel@gluster.org>, "Nithya
Balachandran" <nbala...@redhat.com>, "Shyamsundar
Ranganathan" <srang...@redhat.com>
Sent: Friday, February 5, 2016 4:32:40 PM
Subject: Re: Rebalance data migration and corruption

+gluster-devel



Hi Sakshi/Susant,

- There is a data corruption issue in migration code. Rebalance
process,
   1. Reads data from src
   2. Writes (say w1) it to dst

   However, 1 and 2 are not atomic, so another write (say w2) to
same region
   can happen between 1. But these two writes can reach dst in the
order
   (w2,
   w1) resulting in a subtle corruption. This issue is not fixed yet
and can
   cause subtle data corruptions. The fix is simple and involves
rebalance
   process acquiring a mandatory lock to make 1 and 2 atomic.


We can make use of compound fop framework to make sure we don't suffer a
significant performance hit. Following will be the sequence of
operations
done by rebalance process:

1. issues a compound (mandatory lock, read) operation on src.
2. writes this data to dst.
3. issues unlock of lock acquired in 1.

Please co-ordinate with Anuradha for implementation of this compound
fop.

Following are the issues I see with this approach:
1. features/locks provides mandatory lock functionality only for
posix-locks
(flock and fcntl based locks). So, mandatory locks will be
posix-locks which
will conflict with locks held by application. So, if an application
has held
an fcntl/flock, migration cannot proceed.


What if the file is opened with O_NONBLOCK? Cant rebalance process skip 
the file and continue in case if mandatory lock acquisition fails?




We can implement a "special" domain for mandatory internal locks.
These locks will behave similar to posix mandatory locks in that
conflicting fops (like write, read) are blocked/failed if they are
done while a lock is held.


So is the only difference between mandatory internal locks and posix 
mandatory locks is that internal locks shall not conflict with other 
application locks(advisory/mandatory)?





2. data migration will be less efficient because of an extra unlock
(with
compound lock + read) or extra lock and unlock (for non-compound fop
based
implementation) for every read it does from src.


Can we use delegations here? Rebalance process can acquire a
mandatory-write-delegation (an exclusive lock with a functionality
that delegation is recalled when a write operation happens). In that
case rebalance process, can do something like:

1. Acquire a read delegation for entire file.
2. Migrate the entire file.
3. Remove/unlock/give-back the delegation it has acquired.

If a recall is issued from brick (when a write happens from mount), it
completes the current write to dst (or throws away the read from src)
to maintain atomicity. Before doing next set of (read, src) and
(write, dst) tries to reacquire lock.


With delegations this simplifies the normal path, when a file is
exclusively handled by rebalance. It also improves the case where a
client and rebalance are conflicting on a file, to degrade to mandatory
locks by either parties.

I would prefer we take the delegation route for such needs in the future.

Right. But if there are simultaneous access to the same file from any 
other client and rebalance process, delegations shall not be granted or 
revoked if granted even though they are operating at different offsets. 
So if you rely only on delegations, migration may not proceed if an 
application has held a lock or doing any I/Os.


Also ideally rebalance process has to take write delegation as it would 
end up writing the data on destination brick which shall affect READ 
I/Os, (though of course we can have special checks/hacks for internal 
generated fops).


That said, having delegations shall definitely ensure correctness with 
respect to exclusive file access.


Thanks,
Soumya



@Soumyak, can something like this be done with delegations?

@Pranith,
Afr does transactions for writing to its subvols. Can you suggest any
optimizations here so that rebalance process can have a transaction
for (read, src) and (write, dst) with minimal performance overhead?

regards,
Raghavendra.



Comments?



regards,
Raghavendra.



___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-devel


Re: [Gluster-devel] Rebalance data migration and corruption

2016-02-08 Thread Raghavendra G
On Mon, Feb 8, 2016 at 4:31 PM, Soumya Koduri <skod...@redhat.com> wrote:

>
>
> On 02/08/2016 09:13 AM, Shyam wrote:
>
>> On 02/06/2016 06:36 PM, Raghavendra Gowdappa wrote:
>>
>>>
>>>
>>> - Original Message -
>>>
>>>> From: "Raghavendra Gowdappa" <rgowd...@redhat.com>
>>>> To: "Sakshi Bansal" <saban...@redhat.com>, "Susant Palai"
>>>> <spa...@redhat.com>
>>>> Cc: "Gluster Devel" <gluster-devel@gluster.org>, "Nithya
>>>> Balachandran" <nbala...@redhat.com>, "Shyamsundar
>>>> Ranganathan" <srang...@redhat.com>
>>>> Sent: Friday, February 5, 2016 4:32:40 PM
>>>> Subject: Re: Rebalance data migration and corruption
>>>>
>>>> +gluster-devel
>>>>
>>>>
>>>>> Hi Sakshi/Susant,
>>>>>
>>>>> - There is a data corruption issue in migration code. Rebalance
>>>>> process,
>>>>>1. Reads data from src
>>>>>2. Writes (say w1) it to dst
>>>>>
>>>>>However, 1 and 2 are not atomic, so another write (say w2) to
>>>>> same region
>>>>>can happen between 1. But these two writes can reach dst in the
>>>>> order
>>>>>(w2,
>>>>>w1) resulting in a subtle corruption. This issue is not fixed yet
>>>>> and can
>>>>>cause subtle data corruptions. The fix is simple and involves
>>>>> rebalance
>>>>>process acquiring a mandatory lock to make 1 and 2 atomic.
>>>>>
>>>>
>>>> We can make use of compound fop framework to make sure we don't suffer a
>>>> significant performance hit. Following will be the sequence of
>>>> operations
>>>> done by rebalance process:
>>>>
>>>> 1. issues a compound (mandatory lock, read) operation on src.
>>>> 2. writes this data to dst.
>>>> 3. issues unlock of lock acquired in 1.
>>>>
>>>> Please co-ordinate with Anuradha for implementation of this compound
>>>> fop.
>>>>
>>>> Following are the issues I see with this approach:
>>>> 1. features/locks provides mandatory lock functionality only for
>>>> posix-locks
>>>> (flock and fcntl based locks). So, mandatory locks will be
>>>> posix-locks which
>>>> will conflict with locks held by application. So, if an application
>>>> has held
>>>> an fcntl/flock, migration cannot proceed.
>>>>
>>>
> What if the file is opened with O_NONBLOCK? Cant rebalance process skip
> the file and continue in case if mandatory lock acquisition fails?


Similar functionality can be achieved by acquiring non-blocking inodelk
like SETLK (as opposed to SETLKW). However whether rebalance process should
block or not depends on the use case. In Some use-cases (like remove-brick)
rebalance process _has_ to migrate all the files. Even for other scenarios
skipping too many files is not a good idea as it beats the purpose of
running rebalance. So one of the design goals is to migrate as many files
as possible without making design too complex.


>
>
>>> We can implement a "special" domain for mandatory internal locks.
>>> These locks will behave similar to posix mandatory locks in that
>>> conflicting fops (like write, read) are blocked/failed if they are
>>> done while a lock is held.
>>>
>>
> So is the only difference between mandatory internal locks and posix
> mandatory locks is that internal locks shall not conflict with other
> application locks(advisory/mandatory)?


Yes. Mandatory internal locks (aka Mandatory inodelk for this discussion)
will conflict only in their domain. They also conflict with any fops that
might change the file (primarily write here, but different fops can be
added based on requirement). So in a fop like writev we need to check in
two lists - external lock (posix lock) list _and_ mandatory inodelk list.

The reason (if not clear) for using mandatory locks by rebalance process is
that clients need not be bothered with acquiring a lock (which will
unnecessarily degrade performance of I/O when there is no rebalance going
on). Thanks to Raghavendra Talur for suggesting this idea (though in a
different context of lock migration, but the use-cases are similar).


>
>
>>> 2. data migration will be less efficient because of an extra unlock
>>>> (with
>>>> compound lock 

Re: [Gluster-devel] Rebalance data migration and corruption

2016-02-08 Thread Raghavendra G
>>Right. But if there are simultaneous access to the same file from

> any other client and rebalance process, delegations shall not be
>> granted or revoked if granted even though they are operating at
>> different offsets. So if you rely only on delegations, migration may
>> not proceed if an application has held a lock or doing any I/Os.
>>
>>
>> Does the brick process wait for the response of delegation holder
>> (rebalance process here) before it wipes out the delegation/locks? If
>> that's the case, rebalance process can complete one transaction of
>> (read, src) and (write, dst) before responding to a delegation recall.
>> That way there is no starvation for both applications and rebalance
>> process (though this makes both of them slower, but that cannot helped I
>> think).
>>
>
> yes. Brick process should wait for certain period before revoking the
> delegations forcefully in case if it is not returned by the client. Also if
> required (like done by NFS servers) we can choose to increase this timeout
> value at run time if the client is diligently flushing the data.


hmm.. I would prefer an infinite timeout. The only scenario where brick
process can forcefully flush leases would be connection lose with rebalance
process. The more scenarios where brick can flush leases without knowledge
of rebalance process, we open up more race-windows for this bug to occur.

In fact at least in theory to be correct, rebalance process should replay
all the transactions that happened during the lease which got flushed out
by brick (after re-acquiring that lease). So, we would like to avoid any
such scenarios.

Btw, what is the necessity of timeouts? Is it an insurance against rogue
clients who won't respond back to lease recalls?
___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-devel

Re: [Gluster-devel] Rebalance data migration and corruption

2016-02-08 Thread Raghavendra Gowdappa


- Original Message -
> From: "Joe Julian" <j...@julianfamily.org>
> To: "Raghavendra Gowdappa" <rgowd...@redhat.com>
> Cc: gluster-devel@gluster.org
> Sent: Monday, February 8, 2016 9:08:45 PM
> Subject: Re: [Gluster-devel] Rebalance data migration and corruption
> 
> 
> 
> On 02/08/2016 12:18 AM, Raghavendra Gowdappa wrote:
> >
> > - Original Message -
> >> From: "Joe Julian" <j...@julianfamily.org>
> >> To: gluster-devel@gluster.org
> >> Sent: Monday, February 8, 2016 12:20:27 PM
> >> Subject: Re: [Gluster-devel] Rebalance data migration and corruption
> >>
> >> Is this in current release versions?
> > Yes. This bug is present in currently released versions. However, it can
> > happen only if writes from application are happening to a file when it is
> > being migrated. So, vaguely one can say probability is less.
> 
> Probability is quite high when the volume is used for VM images, which
> many are.

The primary requirement for this corruption is that file should be under 
migration. Given that rebalance is done only during add/remove brick scenarios 
(or may be as a routine housekeeping to make lookups faster), I added that 
probability is lower. However, this will not be the case with tier where files 
can be under constant promotion/demotion because of access patterns. If there 
is a constant migration, dht too is susceptible to this bug with similar 
probability.

> 
> >
> >> On 02/07/2016 07:43 PM, Shyam wrote:
> >>> On 02/06/2016 06:36 PM, Raghavendra Gowdappa wrote:
> >>>>
> >>>> - Original Message -
> >>>>> From: "Raghavendra Gowdappa" <rgowd...@redhat.com>
> >>>>> To: "Sakshi Bansal" <saban...@redhat.com>, "Susant Palai"
> >>>>> <spa...@redhat.com>
> >>>>> Cc: "Gluster Devel" <gluster-devel@gluster.org>, "Nithya
> >>>>> Balachandran" <nbala...@redhat.com>, "Shyamsundar
> >>>>> Ranganathan" <srang...@redhat.com>
> >>>>> Sent: Friday, February 5, 2016 4:32:40 PM
> >>>>> Subject: Re: Rebalance data migration and corruption
> >>>>>
> >>>>> +gluster-devel
> >>>>>
> >>>>>> Hi Sakshi/Susant,
> >>>>>>
> >>>>>> - There is a data corruption issue in migration code. Rebalance
> >>>>>> process,
> >>>>>> 1. Reads data from src
> >>>>>> 2. Writes (say w1) it to dst
> >>>>>>
> >>>>>> However, 1 and 2 are not atomic, so another write (say w2) to
> >>>>>> same region
> >>>>>> can happen between 1. But these two writes can reach dst in the
> >>>>>> order
> >>>>>> (w2,
> >>>>>> w1) resulting in a subtle corruption. This issue is not fixed
> >>>>>> yet and can
> >>>>>> cause subtle data corruptions. The fix is simple and involves
> >>>>>> rebalance
> >>>>>> process acquiring a mandatory lock to make 1 and 2 atomic.
> >>>>> We can make use of compound fop framework to make sure we don't
> >>>>> suffer a
> >>>>> significant performance hit. Following will be the sequence of
> >>>>> operations
> >>>>> done by rebalance process:
> >>>>>
> >>>>> 1. issues a compound (mandatory lock, read) operation on src.
> >>>>> 2. writes this data to dst.
> >>>>> 3. issues unlock of lock acquired in 1.
> >>>>>
> >>>>> Please co-ordinate with Anuradha for implementation of this compound
> >>>>> fop.
> >>>>>
> >>>>> Following are the issues I see with this approach:
> >>>>> 1. features/locks provides mandatory lock functionality only for
> >>>>> posix-locks
> >>>>> (flock and fcntl based locks). So, mandatory locks will be
> >>>>> posix-locks which
> >>>>> will conflict with locks held by application. So, if an application
> >>>>> has held
> >>>>> an fcntl/flock, migration cannot proceed.
> >>>> We can implement a "special" domain for mandatory internal locks.
> >>>> These locks will behave simil

Re: [Gluster-devel] Rebalance data migration and corruption

2016-02-08 Thread Soumya Koduri



On 02/09/2016 10:27 AM, Raghavendra G wrote:



On Mon, Feb 8, 2016 at 4:31 PM, Soumya Koduri <skod...@redhat.com
<mailto:skod...@redhat.com>> wrote:



On 02/08/2016 09:13 AM, Shyam wrote:

On 02/06/2016 06:36 PM, Raghavendra Gowdappa wrote:



- Original Message -

From: "Raghavendra Gowdappa" <rgowd...@redhat.com
<mailto:rgowd...@redhat.com>>
To: "Sakshi Bansal" <saban...@redhat.com
<mailto:saban...@redhat.com>>, "Susant Palai"
<spa...@redhat.com <mailto:spa...@redhat.com>>
Cc: "Gluster Devel" <gluster-devel@gluster.org
<mailto:gluster-devel@gluster.org>>, "Nithya
Balachandran" <nbala...@redhat.com
<mailto:nbala...@redhat.com>>, "Shyamsundar
Ranganathan" <srang...@redhat.com
        <mailto:srang...@redhat.com>>
        Sent: Friday, February 5, 2016 4:32:40 PM
Subject: Re: Rebalance data migration and corruption

+gluster-devel


Hi Sakshi/Susant,

- There is a data corruption issue in migration
code. Rebalance
process,
1. Reads data from src
2. Writes (say w1) it to dst

However, 1 and 2 are not atomic, so another
write (say w2) to
same region
can happen between 1. But these two writes can
reach dst in the
order
(w2,
w1) resulting in a subtle corruption. This issue
is not fixed yet
and can
cause subtle data corruptions. The fix is simple
and involves
rebalance
process acquiring a mandatory lock to make 1 and
2 atomic.


We can make use of compound fop framework to make sure
we don't suffer a
significant performance hit. Following will be the
sequence of
operations
done by rebalance process:

1. issues a compound (mandatory lock, read) operation on
src.
2. writes this data to dst.
3. issues unlock of lock acquired in 1.

Please co-ordinate with Anuradha for implementation of
this compound
fop.

Following are the issues I see with this approach:
1. features/locks provides mandatory lock functionality
only for
posix-locks
(flock and fcntl based locks). So, mandatory locks will be
posix-locks which
will conflict with locks held by application. So, if an
application
has held
an fcntl/flock, migration cannot proceed.


What if the file is opened with O_NONBLOCK? Cant rebalance process
skip the file and continue in case if mandatory lock acquisition fails?


Similar functionality can be achieved by acquiring non-blocking inodelk
like SETLK (as opposed to SETLKW). However whether rebalance process
should block or not depends on the use case. In Some use-cases (like
remove-brick) rebalance process _has_ to migrate all the files. Even for
other scenarios skipping too many files is not a good idea as it beats
the purpose of running rebalance. So one of the design goals is to
migrate as many files as possible without making design too complex.




We can implement a "special" domain for mandatory internal
locks.
These locks will behave similar to posix mandatory locks in that
conflicting fops (like write, read) are blocked/failed if
they are
done while a lock is held.


So is the only difference between mandatory internal locks and posix
mandatory locks is that internal locks shall not conflict with other
application locks(advisory/mandatory)?


Yes. Mandatory internal locks (aka Mandatory inodelk for this
discussion) will conflict only in their domain. They also conflict with
any fops that might change the file (primarily write here, but different
fops can be added based on requirement). So in a fop like writev we need
to check in two lists - external lock (posix lock) list _and_ mandatory
inodelk list.

The reason (if not clear) for using mandatory locks by rebalance process
is that clients need not be bothered with acquiring a lock (which will
unnecessarily degrade performance of I/O when there is no rebalance
going on). Th

Re: [Gluster-devel] Rebalance data migration and corruption

2016-02-08 Thread Joe Julian



On 02/08/2016 12:18 AM, Raghavendra Gowdappa wrote:


- Original Message -

From: "Joe Julian" <j...@julianfamily.org>
To: gluster-devel@gluster.org
Sent: Monday, February 8, 2016 12:20:27 PM
Subject: Re: [Gluster-devel] Rebalance data migration and corruption

Is this in current release versions?

Yes. This bug is present in currently released versions. However, it can happen 
only if writes from application are happening to a file when it is being 
migrated. So, vaguely one can say probability is less.


Probability is quite high when the volume is used for VM images, which 
many are.





On 02/07/2016 07:43 PM, Shyam wrote:

On 02/06/2016 06:36 PM, Raghavendra Gowdappa wrote:


- Original Message -

From: "Raghavendra Gowdappa" <rgowd...@redhat.com>
To: "Sakshi Bansal" <saban...@redhat.com>, "Susant Palai"
<spa...@redhat.com>
Cc: "Gluster Devel" <gluster-devel@gluster.org>, "Nithya
Balachandran" <nbala...@redhat.com>, "Shyamsundar
Ranganathan" <srang...@redhat.com>
Sent: Friday, February 5, 2016 4:32:40 PM
Subject: Re: Rebalance data migration and corruption

+gluster-devel


Hi Sakshi/Susant,

- There is a data corruption issue in migration code. Rebalance
process,
1. Reads data from src
2. Writes (say w1) it to dst

However, 1 and 2 are not atomic, so another write (say w2) to
same region
can happen between 1. But these two writes can reach dst in the
order
(w2,
w1) resulting in a subtle corruption. This issue is not fixed
yet and can
cause subtle data corruptions. The fix is simple and involves
rebalance
process acquiring a mandatory lock to make 1 and 2 atomic.

We can make use of compound fop framework to make sure we don't
suffer a
significant performance hit. Following will be the sequence of
operations
done by rebalance process:

1. issues a compound (mandatory lock, read) operation on src.
2. writes this data to dst.
3. issues unlock of lock acquired in 1.

Please co-ordinate with Anuradha for implementation of this compound
fop.

Following are the issues I see with this approach:
1. features/locks provides mandatory lock functionality only for
posix-locks
(flock and fcntl based locks). So, mandatory locks will be
posix-locks which
will conflict with locks held by application. So, if an application
has held
an fcntl/flock, migration cannot proceed.

We can implement a "special" domain for mandatory internal locks.
These locks will behave similar to posix mandatory locks in that
conflicting fops (like write, read) are blocked/failed if they are
done while a lock is held.


2. data migration will be less efficient because of an extra unlock
(with
compound lock + read) or extra lock and unlock (for non-compound fop
based
implementation) for every read it does from src.

Can we use delegations here? Rebalance process can acquire a
mandatory-write-delegation (an exclusive lock with a functionality
that delegation is recalled when a write operation happens). In that
case rebalance process, can do something like:

1. Acquire a read delegation for entire file.
2. Migrate the entire file.
3. Remove/unlock/give-back the delegation it has acquired.

If a recall is issued from brick (when a write happens from mount),
it completes the current write to dst (or throws away the read from
src) to maintain atomicity. Before doing next set of (read, src) and
(write, dst) tries to reacquire lock.

With delegations this simplifies the normal path, when a file is
exclusively handled by rebalance. It also improves the case where a
client and rebalance are conflicting on a file, to degrade to
mandatory locks by either parties.

I would prefer we take the delegation route for such needs in the future.


@Soumyak, can something like this be done with delegations?

@Pranith,
Afr does transactions for writing to its subvols. Can you suggest any
optimizations here so that rebalance process can have a transaction
for (read, src) and (write, dst) with minimal performance overhead?

regards,
Raghavendra.


Comments?


regards,
Raghavendra.

___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-devel

___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-devel



___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-devel


Re: [Gluster-devel] Rebalance data migration and corruption

2016-02-07 Thread Shyam

On 02/06/2016 06:36 PM, Raghavendra Gowdappa wrote:



- Original Message -

From: "Raghavendra Gowdappa" <rgowd...@redhat.com>
To: "Sakshi Bansal" <saban...@redhat.com>, "Susant Palai" <spa...@redhat.com>
Cc: "Gluster Devel" <gluster-devel@gluster.org>, "Nithya Balachandran" 
<nbala...@redhat.com>, "Shyamsundar
Ranganathan" <srang...@redhat.com>
Sent: Friday, February 5, 2016 4:32:40 PM
Subject: Re: Rebalance data migration and corruption

+gluster-devel



Hi Sakshi/Susant,

- There is a data corruption issue in migration code. Rebalance process,
   1. Reads data from src
   2. Writes (say w1) it to dst

   However, 1 and 2 are not atomic, so another write (say w2) to same region
   can happen between 1. But these two writes can reach dst in the order
   (w2,
   w1) resulting in a subtle corruption. This issue is not fixed yet and can
   cause subtle data corruptions. The fix is simple and involves rebalance
   process acquiring a mandatory lock to make 1 and 2 atomic.


We can make use of compound fop framework to make sure we don't suffer a
significant performance hit. Following will be the sequence of operations
done by rebalance process:

1. issues a compound (mandatory lock, read) operation on src.
2. writes this data to dst.
3. issues unlock of lock acquired in 1.

Please co-ordinate with Anuradha for implementation of this compound fop.

Following are the issues I see with this approach:
1. features/locks provides mandatory lock functionality only for posix-locks
(flock and fcntl based locks). So, mandatory locks will be posix-locks which
will conflict with locks held by application. So, if an application has held
an fcntl/flock, migration cannot proceed.


We can implement a "special" domain for mandatory internal locks. These locks 
will behave similar to posix mandatory locks in that conflicting fops (like write, read) 
are blocked/failed if they are done while a lock is held.


2. data migration will be less efficient because of an extra unlock (with
compound lock + read) or extra lock and unlock (for non-compound fop based
implementation) for every read it does from src.


Can we use delegations here? Rebalance process can acquire a 
mandatory-write-delegation (an exclusive lock with a functionality that 
delegation is recalled when a write operation happens). In that case rebalance 
process, can do something like:

1. Acquire a read delegation for entire file.
2. Migrate the entire file.
3. Remove/unlock/give-back the delegation it has acquired.

If a recall is issued from brick (when a write happens from mount), it 
completes the current write to dst (or throws away the read from src) to 
maintain atomicity. Before doing next set of (read, src) and (write, dst) tries 
to reacquire lock.


With delegations this simplifies the normal path, when a file is 
exclusively handled by rebalance. It also improves the case where a 
client and rebalance are conflicting on a file, to degrade to mandatory 
locks by either parties.


I would prefer we take the delegation route for such needs in the future.



@Soumyak, can something like this be done with delegations?

@Pranith,
Afr does transactions for writing to its subvols. Can you suggest any 
optimizations here so that rebalance process can have a transaction for (read, 
src) and (write, dst) with minimal performance overhead?

regards,
Raghavendra.



Comments?



regards,
Raghavendra.



___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-devel


Re: [Gluster-devel] Rebalance data migration and corruption

2016-02-07 Thread Joe Julian

Is this in current release versions?

On 02/07/2016 07:43 PM, Shyam wrote:

On 02/06/2016 06:36 PM, Raghavendra Gowdappa wrote:



- Original Message -

From: "Raghavendra Gowdappa" <rgowd...@redhat.com>
To: "Sakshi Bansal" <saban...@redhat.com>, "Susant Palai" 
<spa...@redhat.com>
Cc: "Gluster Devel" <gluster-devel@gluster.org>, "Nithya 
Balachandran" <nbala...@redhat.com>, "Shyamsundar

Ranganathan" <srang...@redhat.com>
Sent: Friday, February 5, 2016 4:32:40 PM
Subject: Re: Rebalance data migration and corruption

+gluster-devel



Hi Sakshi/Susant,

- There is a data corruption issue in migration code. Rebalance 
process,

   1. Reads data from src
   2. Writes (say w1) it to dst

   However, 1 and 2 are not atomic, so another write (say w2) to 
same region
   can happen between 1. But these two writes can reach dst in the 
order

   (w2,
   w1) resulting in a subtle corruption. This issue is not fixed 
yet and can
   cause subtle data corruptions. The fix is simple and involves 
rebalance

   process acquiring a mandatory lock to make 1 and 2 atomic.


We can make use of compound fop framework to make sure we don't 
suffer a
significant performance hit. Following will be the sequence of 
operations

done by rebalance process:

1. issues a compound (mandatory lock, read) operation on src.
2. writes this data to dst.
3. issues unlock of lock acquired in 1.

Please co-ordinate with Anuradha for implementation of this compound 
fop.


Following are the issues I see with this approach:
1. features/locks provides mandatory lock functionality only for 
posix-locks
(flock and fcntl based locks). So, mandatory locks will be 
posix-locks which
will conflict with locks held by application. So, if an application 
has held

an fcntl/flock, migration cannot proceed.


We can implement a "special" domain for mandatory internal locks. 
These locks will behave similar to posix mandatory locks in that 
conflicting fops (like write, read) are blocked/failed if they are 
done while a lock is held.


2. data migration will be less efficient because of an extra unlock 
(with
compound lock + read) or extra lock and unlock (for non-compound fop 
based

implementation) for every read it does from src.


Can we use delegations here? Rebalance process can acquire a 
mandatory-write-delegation (an exclusive lock with a functionality 
that delegation is recalled when a write operation happens). In that 
case rebalance process, can do something like:


1. Acquire a read delegation for entire file.
2. Migrate the entire file.
3. Remove/unlock/give-back the delegation it has acquired.

If a recall is issued from brick (when a write happens from mount), 
it completes the current write to dst (or throws away the read from 
src) to maintain atomicity. Before doing next set of (read, src) and 
(write, dst) tries to reacquire lock.


With delegations this simplifies the normal path, when a file is 
exclusively handled by rebalance. It also improves the case where a 
client and rebalance are conflicting on a file, to degrade to 
mandatory locks by either parties.


I would prefer we take the delegation route for such needs in the future.



@Soumyak, can something like this be done with delegations?

@Pranith,
Afr does transactions for writing to its subvols. Can you suggest any 
optimizations here so that rebalance process can have a transaction 
for (read, src) and (write, dst) with minimal performance overhead?


regards,
Raghavendra.



Comments?



regards,
Raghavendra.



___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-devel


___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-devel


Re: [Gluster-devel] Rebalance data migration and corruption

2016-02-06 Thread Raghavendra Gowdappa


- Original Message -
> From: "Raghavendra Gowdappa" <rgowd...@redhat.com>
> To: "Sakshi Bansal" <saban...@redhat.com>, "Susant Palai" <spa...@redhat.com>
> Cc: "Gluster Devel" <gluster-devel@gluster.org>, "Nithya Balachandran" 
> <nbala...@redhat.com>, "Shyamsundar
> Ranganathan" <srang...@redhat.com>
> Sent: Friday, February 5, 2016 4:32:40 PM
> Subject: Re: Rebalance data migration and corruption
> 
> +gluster-devel
> 
> > 
> > Hi Sakshi/Susant,
> > 
> > - There is a data corruption issue in migration code. Rebalance process,
> >   1. Reads data from src
> >   2. Writes (say w1) it to dst
> > 
> >   However, 1 and 2 are not atomic, so another write (say w2) to same region
> >   can happen between 1. But these two writes can reach dst in the order
> >   (w2,
> >   w1) resulting in a subtle corruption. This issue is not fixed yet and can
> >   cause subtle data corruptions. The fix is simple and involves rebalance
> >   process acquiring a mandatory lock to make 1 and 2 atomic.
> 
> We can make use of compound fop framework to make sure we don't suffer a
> significant performance hit. Following will be the sequence of operations
> done by rebalance process:
> 
> 1. issues a compound (mandatory lock, read) operation on src.
> 2. writes this data to dst.
> 3. issues unlock of lock acquired in 1.
> 
> Please co-ordinate with Anuradha for implementation of this compound fop.
> 
> Following are the issues I see with this approach:
> 1. features/locks provides mandatory lock functionality only for posix-locks
> (flock and fcntl based locks). So, mandatory locks will be posix-locks which
> will conflict with locks held by application. So, if an application has held
> an fcntl/flock, migration cannot proceed.

We can implement a "special" domain for mandatory internal locks. These locks 
will behave similar to posix mandatory locks in that conflicting fops (like 
write, read) are blocked/failed if they are done while a lock is held.

> 2. data migration will be less efficient because of an extra unlock (with
> compound lock + read) or extra lock and unlock (for non-compound fop based
> implementation) for every read it does from src.

Can we use delegations here? Rebalance process can acquire a 
mandatory-write-delegation (an exclusive lock with a functionality that 
delegation is recalled when a write operation happens). In that case rebalance 
process, can do something like:

1. Acquire a read delegation for entire file.
2. Migrate the entire file.
3. Remove/unlock/give-back the delegation it has acquired.

If a recall is issued from brick (when a write happens from mount), it 
completes the current write to dst (or throws away the read from src) to 
maintain atomicity. Before doing next set of (read, src) and (write, dst) tries 
to reacquire lock.

@Soumyak, can something like this be done with delegations?

@Pranith,
Afr does transactions for writing to its subvols. Can you suggest any 
optimizations here so that rebalance process can have a transaction for (read, 
src) and (write, dst) with minimal performance overhead?

regards,
Raghavendra.

> 
> Comments?
> 
> > 
> > regards,
> > Raghavendra.
> 
___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-devel


Re: [Gluster-devel] Rebalance data migration and corruption

2016-02-05 Thread Raghavendra Gowdappa
+gluster-devel

> 
> Hi Sakshi/Susant,
> 
> - There is a data corruption issue in migration code. Rebalance process,
>   1. Reads data from src
>   2. Writes (say w1) it to dst
> 
>   However, 1 and 2 are not atomic, so another write (say w2) to same region
>   can happen between 1. But these two writes can reach dst in the order (w2,
>   w1) resulting in a subtle corruption. This issue is not fixed yet and can
>   cause subtle data corruptions. The fix is simple and involves rebalance
>   process acquiring a mandatory lock to make 1 and 2 atomic.

We can make use of compound fop framework to make sure we don't suffer a 
significant performance hit. Following will be the sequence of operations done 
by rebalance process:

1. issues a compound (mandatory lock, read) operation on src.
2. writes this data to dst.
3. issues unlock of lock acquired in 1.

Please co-ordinate with Anuradha for implementation of this compound fop.

Following are the issues I see with this approach:
1. features/locks provides mandatory lock functionality only for posix-locks 
(flock and fcntl based locks). So, mandatory locks will be posix-locks which 
will conflict with locks held by application. So, if an application has held an 
fcntl/flock, migration cannot proceed.
2. data migration will be less efficient because of an extra unlock (with 
compound lock + read) or extra lock and unlock (for non-compound fop based 
implementation) for every read it does from src.

Comments?

> 
> regards,
> Raghavendra.
___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-devel