Re: Usefulness of ensemble change during recovery

2018-08-13 Thread Ivan Kelly
Yup, we had already concluded we need the ensemble change for some
cases. Code didn't turn out as messy as I'd feared though (I don't
think I've pushed this yet).

-Ivan

On Mon, Aug 13, 2018 at 8:29 PM, Sam Just  wrote:
> To flesh out JV's point a bit more, suppose we've got a 5/5/4 ledger which
> needs to be recovery opened.  In such a scenario, suppose the last entry on
> each of the 5 bookies (no holes) are 10,10,10,10,19.  Any entry in [10,19]
> is valid as the end of the ledger, but the safest answer for the end of the
> ledger is really 10 here -- 11-19 cannot have been ack'd to the client and
> we have 5 copies of 0-10, but only 1 of 11-19.  Currently, a client
> performing a recovery open on this ledger which is able to talk to all 5
> bookies will read and rewrite up to 19 ensuring that at least 4 bookies end
> up with 11-19.  I'd argue that rewriting the entries in that case is
> important if we want to let 19 be the end of the ledger because once we
> permit a client to read 19, losing that single copy would genuinely be data
> loss.  In that case, it happens that we have enough information to mark 10
> as the end of the ledger, but if the client performing recovery open has
> access only to bookies 3 and 4, it would be forced to conclude that 19
> could be the end of the ledger.  In that case, if we want to avoid exposing
> entries which have never been written to fewer than aQ bookies, we really
> do have to either
> 1) do an ensemble change and write out the tail entries of the ledger to a
> healthy ensemble
> 2) fail the recovery open
>
> I'd therefore argue that repairing the tail of the ledger -- with an
> ensemble change if necessary -- is actually required to allow readers to
> access the ledger.
> -Sam
>
> On Mon, Aug 6, 2018 at 9:27 AM Venkateswara Rao Jujjuri 
> wrote:
>
>> I don't think it a good idea to leave the tail to the replication.
>> This could lead to the perception of data loss, and it's more evident in
>> the case of larger WQ and disparity with AQ.
>> If we determine LLAC based on having 'a copy', which is never acknowledged
>> to the client, and if that bookie goes down(or crashes and burns)
>> before replication worker gets a chance, it gives the illusion of data
>> loss. Moreover, we have no way to determine the real data loss vs
>> this scenario where we have never acknowledged the client.
>>
>>
>> On Mon, Aug 6, 2018 at 12:32 AM, Sijie Guo  wrote:
>>
>> > On Mon, Aug 6, 2018 at 12:08 AM Ivan Kelly  wrote:
>> >
>> > > >> Recovery operates on a few seconds of data (from the last LAC
>> written
>> > > >> to the end of the ledger, call this LLAC).
>> > > >
>> > > > the data during this duration can be very large if the traffic of the
>> > > > ledger is large. That has
>> > > > been observed at Twitter's production. so when we are talking about
>> "a
>> > > few
>> > > > seconds of data",
>> > > > we can't assume the amount of data is little. That says the recovery
>> > can
>> > > be
>> > > > taking time than
>> > >
>> > > Yes, it can be large, but still it is only a few seconds worth of
>> > > data. It is the amount of data that can be transmitted in the period
>> > > of one roundtrip, as the next roundtrip will update the LAC.
>> >
>> >
>> > > I didn't mean to imply the data was small. I was implying that the
>> > > data was small in comparison to the overall size of that ledger.
>> >
>> >
>> > > > what we can expect, so if we don't handle failures during recovery
>> how
>> > we
>> > > > are able to ensure
>> > > > we have enough data copy during recovery.
>> > >
>> > > Consider a e3w3a2 ledger, there's two cases where you can lose a
>> > > bookie during recover.
>> > >
>> > > Case one, one bookie is lost. You can still recover from as ack=2 is
>> > > available.
>> > > Case two, two bookies are lost. You can't recover, but ledger is
>> > > unavailable anyhow, since any entry in the ledger may only have been
>> > > replicated to 2.
>> > >
>> > > However, with e3w3a3 I guess you wouldn't be able to recover at all,
>> > > and we have to handle that case.
>> > >
>> > > > I am not sure "make ledger metadata immutable" == "getting rid of
>> > merging
>> > > > ledger metadata".
>> > > > because I don't think these are same thing. making ledger metadata
>> > > > immutable will make code
>> > > > much clearer and simpler because the ledger metadata is immutable.
>> how
>> > > > getting rid of merging
>> > > > ledger metadata is a different thing, when you make ledger metadata
>> > > > immutable, it will help make
>> > > > merging ledger metadata on conflicts clearer.
>> > >
>> > > I wouldn't call it merging in this case.
>> >
>> >
>> > That's fine.
>> >
>> >
>> > > Merging implies taking two
>> > > valid pieces of metadata and getting another usable, valid metadata
>> > > from it.
>> > > What happens with immutable metadata, is that you are taking one valid
>> > > metadata, and applying operations to it. So in the failure during
>> > > recovery place, we would have a list of 

Re: Usefulness of ensemble change during recovery

2018-08-13 Thread Sam Just
To flesh out JV's point a bit more, suppose we've got a 5/5/4 ledger which
needs to be recovery opened.  In such a scenario, suppose the last entry on
each of the 5 bookies (no holes) are 10,10,10,10,19.  Any entry in [10,19]
is valid as the end of the ledger, but the safest answer for the end of the
ledger is really 10 here -- 11-19 cannot have been ack'd to the client and
we have 5 copies of 0-10, but only 1 of 11-19.  Currently, a client
performing a recovery open on this ledger which is able to talk to all 5
bookies will read and rewrite up to 19 ensuring that at least 4 bookies end
up with 11-19.  I'd argue that rewriting the entries in that case is
important if we want to let 19 be the end of the ledger because once we
permit a client to read 19, losing that single copy would genuinely be data
loss.  In that case, it happens that we have enough information to mark 10
as the end of the ledger, but if the client performing recovery open has
access only to bookies 3 and 4, it would be forced to conclude that 19
could be the end of the ledger.  In that case, if we want to avoid exposing
entries which have never been written to fewer than aQ bookies, we really
do have to either
1) do an ensemble change and write out the tail entries of the ledger to a
healthy ensemble
2) fail the recovery open

I'd therefore argue that repairing the tail of the ledger -- with an
ensemble change if necessary -- is actually required to allow readers to
access the ledger.
-Sam

On Mon, Aug 6, 2018 at 9:27 AM Venkateswara Rao Jujjuri 
wrote:

> I don't think it a good idea to leave the tail to the replication.
> This could lead to the perception of data loss, and it's more evident in
> the case of larger WQ and disparity with AQ.
> If we determine LLAC based on having 'a copy', which is never acknowledged
> to the client, and if that bookie goes down(or crashes and burns)
> before replication worker gets a chance, it gives the illusion of data
> loss. Moreover, we have no way to determine the real data loss vs
> this scenario where we have never acknowledged the client.
>
>
> On Mon, Aug 6, 2018 at 12:32 AM, Sijie Guo  wrote:
>
> > On Mon, Aug 6, 2018 at 12:08 AM Ivan Kelly  wrote:
> >
> > > >> Recovery operates on a few seconds of data (from the last LAC
> written
> > > >> to the end of the ledger, call this LLAC).
> > > >
> > > > the data during this duration can be very large if the traffic of the
> > > > ledger is large. That has
> > > > been observed at Twitter's production. so when we are talking about
> "a
> > > few
> > > > seconds of data",
> > > > we can't assume the amount of data is little. That says the recovery
> > can
> > > be
> > > > taking time than
> > >
> > > Yes, it can be large, but still it is only a few seconds worth of
> > > data. It is the amount of data that can be transmitted in the period
> > > of one roundtrip, as the next roundtrip will update the LAC.
> >
> >
> > > I didn't mean to imply the data was small. I was implying that the
> > > data was small in comparison to the overall size of that ledger.
> >
> >
> > > > what we can expect, so if we don't handle failures during recovery
> how
> > we
> > > > are able to ensure
> > > > we have enough data copy during recovery.
> > >
> > > Consider a e3w3a2 ledger, there's two cases where you can lose a
> > > bookie during recover.
> > >
> > > Case one, one bookie is lost. You can still recover from as ack=2 is
> > > available.
> > > Case two, two bookies are lost. You can't recover, but ledger is
> > > unavailable anyhow, since any entry in the ledger may only have been
> > > replicated to 2.
> > >
> > > However, with e3w3a3 I guess you wouldn't be able to recover at all,
> > > and we have to handle that case.
> > >
> > > > I am not sure "make ledger metadata immutable" == "getting rid of
> > merging
> > > > ledger metadata".
> > > > because I don't think these are same thing. making ledger metadata
> > > > immutable will make code
> > > > much clearer and simpler because the ledger metadata is immutable.
> how
> > > > getting rid of merging
> > > > ledger metadata is a different thing, when you make ledger metadata
> > > > immutable, it will help make
> > > > merging ledger metadata on conflicts clearer.
> > >
> > > I wouldn't call it merging in this case.
> >
> >
> > That's fine.
> >
> >
> > > Merging implies taking two
> > > valid pieces of metadata and getting another usable, valid metadata
> > > from it.
> > > What happens with immutable metadata, is that you are taking one valid
> > > metadata, and applying operations to it. So in the failure during
> > > recovery place, we would have a list of AddEnsemble operations which
> > > we add when we try to close.
> > >
> > > In theory this is perfectly valid and clean. It just can look messy in
> > > the code, due to how the PendingAddOp reaches back into the ledger
> > > handle to get the current ensemble.
> > >
> >
> > That's okay since it is reality which we have to face anyway. But the
> most
> > 

Re: Dropping 'stream' profile

2018-08-13 Thread Ivan Kelly
+1 for dropping the profiles.

On Mon, Aug 13, 2018 at 12:24 AM, Sijie Guo  wrote:
> I have no problem with this proposal. I am fine with dropping the profiles.
>
> Sijie
>
> On Sun, Aug 12, 2018 at 2:53 AM Enrico Olivelli  wrote:
>
>> Hi,
>> Currently in order to build the full code you have to add -Dstream
>> property, this in turn will activate the 'stream' profile.
>> Additionally to run tests in 'stream' submodule you have to also add
>> -DstreamTests.
>>
>> This is very annoying, and now that we are going to release the 'stream'
>> storage module as first class citizen it does not make much sense.
>>
>> This additional profile makes it more complex project wide operations like
>> the release procedure.
>> For instance I broke master branch yesterday because I did not advance the
>> version in poms in the stream submodule.
>>
>> It is giving a lot of problems on code coverage stuff as well, because we
>> have a very complex configuration of surefire.
>>
>> My proposal is to drop those profiles and let the stream module to be built
>> together with the other parts.
>>
>>
>> For the ones like me that work only on bookkeeper-server this change won't
>> affect every day work.
>>
>> I would prefer that Sijie do this change as he introduced those profiles
>> and knowns very well all the tricks.
>>
>> Regards
>> Enrico
>> --
>>
>>
>> -- Enrico Olivelli
>>