Re: [VOTE] Merge branch HBASE-19397 back to master and branch-2.

Josh Elser Tue, 13 Mar 2018 13:49:14 -0700

To be clear, I didn't mean for this discussion to be some snark directedat you or the serial-replication feature, Duo. You just happened to bethe first person who put up a merge-vote like this into branch-2 and itgot me thinking. I was/am not trying to throw shade at you.

Let me spin this out into a DISCUSS. I think lumping it into your VOTEthread was a bad idea (I was thinking out loud, but obviously theexpectation is that others think about it and weigh in too).


Sorry for the noise.

On 3/13/18 6:44 AM, 张铎(Duo Zhang) wrote:

I've already merged it to branch-2...

And for the procedure based replication peer modification, the feature has
been finished, at least no critical TODOs. I will not say it has no bugs
but I do not think it will block the 2.1 or 3.0 release too much.

And please trust my judgement, I'm not a man who only want to show off. For
example I just reverted the serial replication feature from branch-2 before
we release beta2 and tried to redo it on master because we found some
critical problems. And since the basic idea has not been changed so we
decided to do it on master because it will not spend too much time to
finish. And for HBASE-19064 it is a big feature so we decided to do it on a
feature branch. We will select the branches which we want to merge back at
the time we finish the feature.

For the CP cleanup, I think the problem is we started too late, and in the
past we made mistakes and let the related projects inject too much into the
internal of HBase. And for B&R, it is something like the serial replication
feature. For serial replication feature is that we tested it and found some
critical problems, and for B&R it was not well tested(IIRC) so we were
afraid of there will be critical problems. And for the AMv2, I think the
problem is premature optimization. We even implemented our own AvlTree, and
also a very complicated scheduler which always makes us dead lock...

These are all very good cases for software engineering in the real world...
Should be avoided in the future... That's also my duty...

Thanks.

2018-03-13 17:50 GMT+08:00 Apekshit Sharma <[email protected]>:

I thought the problem with releasing 2.0 was that there were too many open
features dragging along not coming to finish- AMv2  (biggest one), CP
cleanup, B&R, etc.
If it was not the case, it wouldn't have been 1 year between first alpha
and GA, rather just few months.

That said, i don't mind new but hardened features which are already ready
to ship (implementation only or not) going in minor version, but that's my
personal opinion. But going too aggressive on that can indeed lead to the
trap Josh mentioned above.

For this particular change, my +1 was based on following aspects:
-  it's internal
- moving ops procv2 framework (gives failure recovery, locking, etc)
- Duo's reasoning - "....the synchronous guarantee is really a good thing
for our replication related UTs..." and "For the replication peer tracking,
it is the same problem. It is hard to do fencing with zk watcher since it
is asynchronous, so the UTs are always kind of flakey in theoretical."

That said, it's pending review.
@Duo: As you know it's not possible to spend cycles on it right now -
pending 2.0GA - can you please hold it off for few weeks (ideally, until GA
+ 2-3 weeks) which will give community (whoever interested, at least
me..smile) a decent change to review it.
Thanks

-- Appy


On Tue, Mar 13, 2018 at 1:02 AM, Josh Elser <[email protected]> wrote:

(Sorry for something other than just a vote)

I worry seeing a "big" feature branch merge as us falling back into the
1.x trap. Starting to backport features into 2.x will keep us delaying

3.0

as we have less and less incentive to push to release 3.0 in a timely
manner.

That said, I also don't want my worries to bar a feature which appears to
be related to implementation only (based on one high-level read of the
changes). Perhaps we need to re-think what is allowable for a Y release

in

x.y.z...

+1 for master (which already happened, maybe?)
+0 for branch-2 (simply because I haven't looked closely enough at
changes, can read through and try to change to +1 if you need the votes)


On 3/9/18 2:41 AM, 张铎(Duo Zhang) wrote:

Since branch-2.0 has been cut and branch-2 is now 2.1.0-SNAPSHOT, will
merge branch HBASE-19397-branch-2 back to branch-2.

2018-01-10 9:20 GMT+08:00 张铎(Duo Zhang) <[email protected]>:

If branch-2.0 will be out soon then let's target this to 2.1. No

problem.


Thanks.

2018-01-10 1:28 GMT+08:00 Stack <[email protected]>:

On Mon, Jan 8, 2018 at 8:19 PM, 张铎(Duo Zhang) <[email protected]>

wrote:

OK, let me merge it master first. And then create a

HBASE-19397-branch-2

which will keep rebasing with the newest branch-2 to see if it is
stable
enough. Since we can define this as a bug fix/refactoring rather

than a

big

new feature, it is OK to integrate it at any time. If we think it is

stable

enough before cutting branch-2.0 then we can include it in the 2.0.0
release, else let's include it in 2.1(Maybe we can backport it to 2.0
later?).

I need to cut the Appy-suggested branch-2.0. Shout if
HBASE-19397-branch-2
gets to be too much work and I'll do it sooner rather than later. Or,

if

easier on you, just say and I'll make the branch-2.0 now so you can

just

commit to branch-2 (branch-2.0 will become hbase2.0, branch-2 will
become
hbase2.1...).

St.Ack




Thanks all here.


2018-01-09 12:06 GMT+08:00 ashish singhi <[email protected]>:

+1 to merge on master and 2.1.

Great work.

Thanks,
Ashish

-----Original Message-----
From: 张铎(Duo Zhang) [mailto:[email protected]]
Sent: Tuesday, January 09, 2018 6:53 AM
To: [email protected]
Subject: Re: [VOTE] Merge branch HBASE-19397 back to master and

branch-2.

Anyway, if no objections on merging this into master, let's do it?

So

that

we can start working on the follow-on features, such as table based
replication storage, and synchronous replication, etc.

Thanks.

2018-01-09 7:19 GMT+08:00 Stack <[email protected]>:

On Mon, Jan 8, 2018 at 2:50 PM, 张铎(Duo Zhang) <

[email protected]>

wrote:


This 'new' feature only changes DDL part, not the core part of

replication,

i.e, how to read wal entries and how to replicate it to the remote

cluster,

etc. And also there is no pb message/storage layout change, you

can

think of this as a big refactoring. Theoretically we even do not

need to add

new

UTs for this feature, i.e, no extra stability works. The only
visible change to users is that it may require more time on
modifying peers in shell. So in general I think it is less hurt to
include it in the coming release?

And why I think it SHOULD be included in our 2.0 release is that,
the synchronous guarantee is really a good thing for our

replication

related UTs. The correctness of the current Test***Replication

usually depends

on a

flakey condition - we will not do a log rolling between the
modification

on

zk and the actual loading of the modification on RS. And we have
also

done

a hard work to cleanup the lockings in ReplicationSourceManager

and

add a fat comment to say why it should be synchronized in this

way.

In general, the new code is much easier to read, test and debug,

and

also reduce the possibility of flakeyness, which could help us a

lot

when we start to stabilize our build.


Thanks.


You see it as a big bug fix Duo?


Kind of. Just like the AM v1, we can do lots of fix to make it more

stable, but we know that we can not fix all the problems since we

store

state in several places and it is a 'mission impossible' to make all

the

states stay in sync under every situation... So we introduce AM v2.

For the replication peer tracking, it is the same problem. It is

hard

to

do fencing with zk watcher since it is asynchronous, so the UTs are

always

kind of flakey in theoretical. And we depend on replication heavily

at

Xiaomi, it is always a pain for us.

I'm late to review. Will have a look after beta-1 goes out. This

stuff

looks great from outside, especially distributed procedure framework

which we need all over the place.

In general I have no problem w/ this in master and an hbase 2.1

(2.1

could be soon after 2.0). Its late for big stuff in 2.0.... but let

me

take a looksee sir.


Thanks sir. All the concerns here about whether we should merge

this

into

2.0 are reasonable, I know. Although I really want this in 2.0

because I

believe it will help a lot for stabilizing,  I'm OK with merge it to

2.1

only if you guys all think so.

St.Ack





2018-01-09 4:53 GMT+08:00 Apekshit Sharma <[email protected]>:


Same questions as Josh's.

1) We have RCs for beta1 now, which means only commits that can

go

in

are

bug fixes only. This change - although important, needed from

long

time

and

well done (testing, summary, etc) - seems rather very large to

get

into

2.0

now. Needs good justification why it has to be 2.1 instead of

2.0.

-- Appy


On Mon, Jan 8, 2018 at 8:34 AM, Josh Elser <[email protected]>

wrote:

-0 From a general project planning point-of-view (not based on

the technical merit of the code) I am uncomfortable about
pulling in a

brand

new feature after we've already made one beta RC.


Duo -- can you expand on why this feature is so important that
we

should

break our release plan? Are there problems that would make

including

this

in a 2.1/3.0 unnecessarily difficult? Any kind of color you

can

provide

on

"why does this need to go into 2.0?" would be helpful.


On 1/6/18 1:54 AM, Duo Zhang wrote:

https://issues.apache.org/jira/browse/HBASE-19397


We aim to move the peer modification framework from zk

watcher

to procedure

v2 in this issue and the work is done now.

Copy the release note here:

Introduce 5 procedures to do peer modifications:

AddPeerProcedure

RemovePeerProcedure
UpdatePeerConfigProcedure
EnablePeerProcedure
DisablePeerProcedure

The procedures are all executed with the following stage:
1. Call pre CP hook, if an exception is thrown then give up

2.

Check whether the operation is valid, if not then give up 3.

Update peer storage. Notice that if we have entered this
stage,

then

we

can not rollback any more.
4. Schedule sub procedures to refresh the peer config on

every

RS.

5. Do post cleanup if any.

6. Call post CP hook. The exception thrown will be ignored
since we

have

already done the work.


The procedure will hold an exclusive lock on the peer id, so
now

there

is

no concurrent modifications on a single peer.


And now it is guaranteed that once the procedure is done,

the

peer modification has already taken effect on all RSes.


Abstracte a storage layer for replication peer/queue
manangement,

and

refactored the upper layer to remove zk related

naming/code/comment.

Add pre/postExecuteProcedures CP hooks to
RegionServerObserver, and

add

permission check for executeProcedures method which requires

the

caller

to

be system user or super user.

On rolling upgrade: just do not do any replication peer

modifications

during the rolling upgrading. There is no pb/layout changes

on

the peer/queue storage on zk.



And there are other benefits.

First, we have introduced a general procedure framework to

send

tasks

to

RS

and report the report back to Master. It can be used to
implement

other

operations such as ACL change.

Second, zk is used as a external storage now since we do not
depend

on

zk

watcher any more, it will be much easier to implement a

'table

based'

replication peer/queue storage.


Please vote:
[+1] Agree
[-1] Disagree
[0] Neutral

Thanks.


--

-- Appy



--

-- Appy

Re: [VOTE] Merge branch HBASE-19397 back to master and branch-2.

Reply via email to