[GitHub] storm issue #2433: [STORM-2693] Heartbeats and assignments promotion for sto...

2018-03-28 Thread HeartSaVioR
Github user HeartSaVioR commented on the issue:

https://github.com/apache/storm/pull/2433
  
FYI: The test failure looks like unrelated to this change. 
https://travis-ci.org/apache/storm/jobs/359434070
This build has same failure but the build is triggered before this patch is 
merged in.


---


[GitHub] storm issue #2433: [STORM-2693] Heartbeats and assignments promotion for sto...

2018-03-28 Thread HeartSaVioR
Github user HeartSaVioR commented on the issue:

https://github.com/apache/storm/pull/2433
  
Late +1.

FYI: The test looks like failing intermittently, not consistently. I've 
seen succeed build in my fork.
https://travis-ci.org/HeartSaVioR/storm/builds/359567639

We may want to file a new issue and investigate the issue. @revans2 What do 
you think?


---


[GitHub] storm issue #2433: [STORM-2693] Heartbeats and assignments promotion for sto...

2018-03-28 Thread revans2
Github user revans2 commented on the issue:

https://github.com/apache/storm/pull/2433
  
I am +1, but a little nervous that the tests are failing consistently on 
travis in exactly the same way, but never on my laptop, but I think we can work 
that out later if it is a real issue.


---


[GitHub] storm issue #2433: [STORM-2693] Heartbeats and assignments promotion for sto...

2018-03-28 Thread danny0405
Github user danny0405 commented on the issue:

https://github.com/apache/storm/pull/2433
  
@HeartSaVioR
I have applied you suggestions and it worked as desired, thx very much.


---


[GitHub] storm issue #2433: [STORM-2693] Heartbeats and assignments promotion for sto...

2018-03-28 Thread HeartSaVioR
Github user HeartSaVioR commented on the issue:

https://github.com/apache/storm/pull/2433
  
@danny0405 
You may also want to make sure the squashed commit contains full diff via 
comparing contents of commit with PR diff. Once line count goes similar and UTs 
/ manual tests work it will be safe.


---


[GitHub] storm issue #2433: [STORM-2693] Heartbeats and assignments promotion for sto...

2018-03-28 Thread HeartSaVioR
Github user HeartSaVioR commented on the issue:

https://github.com/apache/storm/pull/2433
  
@danny0405 
There's a tip regarding supported features on Github: you can get a patch 
file which contains commits or a full diff file. Just add ".patch" for former, 
or add ".diff" for latter, and Github will serve the file.

Since we just want to have a squashed commit, you can follow below steps to 
push effectively squashed commit: 
1. create a new branch based on latest master branch
2. download diff file via `wget 
https://github.com/apache/storm/pull/2433.diff` or curl or whatever
3. apply diff via `git apply 2433.diff` (after 3 you can see modified files 
as well as untracked files)
4. `mvn clean install -Prat,all-tests` to run tests (mandatory)
5. (optional, but recommended) build binary dist (`mvn clean package` in 
storm-dist/binary) and do manual tests with binary dist as well
6. if everything is OK, add files to the stage and commit the change with 
proper commit log message

Please let me know if the tip doesn't work.


---


[GitHub] storm issue #2433: [STORM-2693] Heartbeats and assignments promotion for sto...

2018-03-28 Thread danny0405
Github user danny0405 commented on the issue:

https://github.com/apache/storm/pull/2433
  
@revans2 @HeartSaVioR 
I have merged master in and fixed the conflicts.

I tried to rebase based on master but there are too much conflicts and it's 
even harder to pick out which code segment is what i want because the patch 
review goes to long time, can we just keep what it is ?


---


[GitHub] storm issue #2433: [STORM-2693] Heartbeats and assignments promotion for sto...

2018-03-27 Thread revans2
Github user revans2 commented on the issue:

https://github.com/apache/storm/pull/2433
  
@danny0405 the changes look good, the conflicts are minimal and I think the 
test failures are spurious.  I am +1 for merging this in.  Please rebase/squash 
the commits, resolve the minor conflicts and I will be happy to merge it in.

I really would like to have someone else give a +1 for the patch too as I 
made some of the changes myself.


---


[GitHub] storm issue #2433: [STORM-2693] Heartbeats and assignments promotion for sto...

2018-03-27 Thread revans2
Github user revans2 commented on the issue:

https://github.com/apache/storm/pull/2433
  
@danny0405 Sorry about how long this has taken.  I am back from vacation 
now. I will take a look at the patch again, and if the conflicts are small 
hopefully we can merge it in today or tomorrow.


---


[GitHub] storm issue #2433: [STORM-2693] Heartbeats and assignments promotion for sto...

2018-03-27 Thread HeartSaVioR
Github user HeartSaVioR commented on the issue:

https://github.com/apache/storm/pull/2433
  
@danny0405 
I assume @revans2 is on vacation, so you may want to take your time to 
handle other tasks for now, and back to this issue when @revans2 is back. Looks 
like he would like to understand the code more deeper, since this touches 
across many parts of core thing. 

Thanks for your great efforts and patience!


---


[GitHub] storm issue #2433: [STORM-2693] Heartbeats and assignments promotion for sto...

2018-03-22 Thread danny0405
Github user danny0405 commented on the issue:

https://github.com/apache/storm/pull/2433
  
@revans2 
Hi, is threre anything i can do now for this patch?


---


[GitHub] storm issue #2433: [STORM-2693] Heartbeats and assignments promotion for sto...

2018-03-13 Thread danny0405
Github user danny0405 commented on the issue:

https://github.com/apache/storm/pull/2433
  
@revans2
I have fixed the nits problems and merge your patch in.


---


[GitHub] storm issue #2433: [STORM-2693] Heartbeats and assignments promotion for sto...

2018-03-12 Thread danny0405
Github user danny0405 commented on the issue:

https://github.com/apache/storm/pull/2433
  
@revans2 
Really thx, i will fix the nits problems as soon as possible, then i will 
merge this patch in.


---


[GitHub] storm issue #2433: [STORM-2693] Heartbeats and assignments promotion for sto...

2018-03-12 Thread revans2
Github user revans2 commented on the issue:

https://github.com/apache/storm/pull/2433
  
@danny0405 I created https://github.com/danny0405/storm/pull/2 to add in 
authorization support for the supervisor and the new nimbus APIs.


---


[GitHub] storm issue #2433: [STORM-2693] Heartbeats and assignments promotion for sto...

2018-03-12 Thread revans2
Github user revans2 commented on the issue:

https://github.com/apache/storm/pull/2433
  
OK I'll try to put up another pull request to cover basic auth for the 
supervisor.


---


[GitHub] storm issue #2433: [STORM-2693] Heartbeats and assignments promotion for sto...

2018-03-08 Thread danny0405
Github user danny0405 commented on the issue:

https://github.com/apache/storm/pull/2433
  
@revans2
Really thx for your contributions and refactoring work, i will follow in 
timely if i have time.


---


[GitHub] storm issue #2433: [STORM-2693] Heartbeats and assignments promotion for sto...

2018-03-08 Thread revans2
Github user revans2 commented on the issue:

https://github.com/apache/storm/pull/2433
  
I decided against refactoring here.  The changes I want to make go a lot 
deeper than just this patch.  and this patch is following the conventions in 
the code already.  I'll spend some time reviewing this and hopefully we can get 
this patch in soon.


---


[GitHub] storm issue #2433: [STORM-2693] Heartbeats and assignments promotion for sto...

2018-03-07 Thread revans2
Github user revans2 commented on the issue:

https://github.com/apache/storm/pull/2433
  
I also noticed that the unit tests are taking much much longer to run with 
this patch.  33 mins vs 18 without it.  I am going to spend some time 
investigating that too.


---


[GitHub] storm issue #2433: [STORM-2693] Heartbeats and assignments promotion for sto...

2018-03-07 Thread revans2
Github user revans2 commented on the issue:

https://github.com/apache/storm/pull/2433
  
I just put up a pull request to make it work with worker tokens.

https://github.com/danny0405/storm/pull/1

I want to spend some more time looking at the code though before merging it 
in.


---


[GitHub] storm issue #2433: [STORM-2693] Heartbeats and assignments promotion for sto...

2018-03-06 Thread HeartSaVioR
Github user HeartSaVioR commented on the issue:

https://github.com/apache/storm/pull/2433
  
@revans2 Sure! Thanks for taking this up.


---


[GitHub] storm issue #2433: [STORM-2693] Heartbeats and assignments promotion for sto...

2018-03-06 Thread revans2
Github user revans2 commented on the issue:

https://github.com/apache/storm/pull/2433
  
@HeartSaVioR @danny0405 

Sorry about my delay on this.  I got pulled off onto some other things for 
the past week or so.  I will start working on a patch to integrate this with 
the worker token code that we already merged in, but it might take a few days 
to get it done.


---


[GitHub] storm issue #2433: [STORM-2693] Heartbeats and assignments promotion for sto...

2018-03-03 Thread HeartSaVioR
Github user HeartSaVioR commented on the issue:

https://github.com/apache/storm/pull/2433
  
@revans2 
I'm also not familiar with newly added worker token feature, and you said 
we need to apply token auth to supervisor as well.
Could you apply worker token feature to this patch so that we could test 
the final patch and merge this in? Thanks in advance!


---


[GitHub] storm issue #2433: [STORM-2693] Heartbeats and assignments promotion for sto...

2018-03-03 Thread danny0405
Github user danny0405 commented on the issue:

https://github.com/apache/storm/pull/2433
  
@HeartSaVioR @revans2 
I have merged in master branch and fix the conflicts, but i'm not very 
familiar with the authorization thing. So could you please help to make some 
contribution ?

Thx very much in advance.


---


[GitHub] storm issue #2433: [STORM-2693] Heartbeats and assignments promotion for sto...

2018-02-26 Thread HeartSaVioR
Github user HeartSaVioR commented on the issue:

https://github.com/apache/storm/pull/2433
  
@danny0405 Yeah it was holiday in South Korea as well but much shorter than 
China. I just would wanted to check you still want to follow up with this patch.


---


[GitHub] storm issue #2433: [STORM-2693] Heartbeats and assignments promotion for sto...

2018-02-26 Thread danny0405
Github user danny0405 commented on the issue:

https://github.com/apache/storm/pull/2433
  
@HeartSaVioR 
Ok, i will start my work as soon as possible, this PR is delayed because i 
was in holiday of the Chinese Spring Festival, and now i'm back to work.


---


[GitHub] storm issue #2433: [STORM-2693] Heartbeats and assignments promotion for sto...

2018-02-24 Thread HeartSaVioR
Github user HeartSaVioR commented on the issue:

https://github.com/apache/storm/pull/2433
  
@danny0405 Kindly reminder.


---


[GitHub] storm issue #2433: [STORM-2693] Heartbeats and assignments promotion for sto...

2018-02-20 Thread HeartSaVioR
Github user HeartSaVioR commented on the issue:

https://github.com/apache/storm/pull/2433
  
@danny0405 
Updated/rebased my patch for supporting old version topology. Manually 
tested with 1.2.1-rc topology.
(I needed STORM-2965 to run the topology smoothly but the patch can be 
applied separately.)


https://github.com/HeartSaVioR/storm/tree/heartbeats-promotion-v2-support-old-workers

https://github.com/HeartSaVioR/storm/commit/f808e53f91835bab081dd0e02e4ed4ae14f164d1


---


[GitHub] storm issue #2433: [STORM-2693] Heartbeats and assignments promotion for sto...

2018-02-08 Thread HeartSaVioR
Github user HeartSaVioR commented on the issue:

https://github.com/apache/storm/pull/2433
  
@danny0405 
Now Bobby's patch for STORM-2898 is merged. Could you start working on 
follow-up work based on STORM-2898? Thanks in advance!


---


[GitHub] storm issue #2433: [STORM-2693] Heartbeats and assignments promotion for sto...

2018-01-25 Thread danny0405
Github user danny0405 commented on the issue:

https://github.com/apache/storm/pull/2433
  
@HeartSaVioR @revans2 
I have add support for multiple supervisors for one node.
Now for supervisor change:

1. The supervisor should be started with a specified port.
2. All the workers will be started with the port of its parent supervisor 
as a start arg.
3. Supervisor will report port -info together with SupervisorInfo to ZK.
4. Nimbus will get the supervisor port-info from reported SupervisorInfos.

Really thx for your nice review work!


---


[GitHub] storm issue #2433: [STORM-2693] Heartbeats and assignments promotion for sto...

2018-01-24 Thread erikdw
Github user erikdw commented on the issue:

https://github.com/apache/storm/pull/2433
  
@revans2 : thanks, please just loop us in and we'll carve out time to 
provide feedback.


---


[GitHub] storm issue #2433: [STORM-2693] Heartbeats and assignments promotion for sto...

2018-01-24 Thread revans2
Github user revans2 commented on the issue:

https://github.com/apache/storm/pull/2433
  
@HeartSaVioR thanks for putting up the reference to my patch for 
STORM-2898.  Sorry I have not been responding for the past few days.  I locked 
myself in a room trying to get it done ASAP, and have been ignoring e-mail.

I plan next to start working on the storm on mesos/yarn/open stack/... API 
changes for 2.x.  the issue that we have been running into is that the APIs 
defined in INimbus and ISupervisor are not really that clear on how they are 
used or how they should be used.  So even though we don't change these APIs we 
change the semantics around how these APIs are used and end up breaking things.

I will file a JIRA soon hopefully with a design on how I want to address 
this, but I am going to need a lot of feedback from @erikdw and @JessicaLHartog 
to make sure the API changes I am proposing will actually be better.


---


[GitHub] storm issue #2433: [STORM-2693] Heartbeats and assignments promotion for sto...

2018-01-24 Thread HeartSaVioR
Github user HeartSaVioR commented on the issue:

https://github.com/apache/storm/pull/2433
  
@danny0405 @revans2 just posted the patch of STORM-2898: #2531. Please 
participate reviewing when you have free time. Thanks in advance.


---


[GitHub] storm issue #2433: [STORM-2693] Heartbeats and assignments promotion for sto...

2018-01-23 Thread HeartSaVioR
Github user HeartSaVioR commented on the issue:

https://github.com/apache/storm/pull/2433
  
@danny0405 Thanks for your thoughtful review. Updated the branch with 
commit amended.

https://github.com/HeartSaVioR/storm/commit/9cc8a1540985d350ede4738a38173df47ae99a32


---


[GitHub] storm issue #2433: [STORM-2693] Heartbeats and assignments promotion for sto...

2018-01-23 Thread danny0405
Github user danny0405 commented on the issue:

https://github.com/apache/storm/pull/2433
  
@HeartSaVioR
Really thx for your nice work, i will apply your patch and do some tests.

Also i'm doing add Supervisor heartbeat port to `Worker.java` as start args 
and `SupervisorHeartbeat` to support multi-supervisors for one node.



---


[GitHub] storm issue #2433: [STORM-2693] Heartbeats and assignments promotion for sto...

2018-01-23 Thread HeartSaVioR
Github user HeartSaVioR commented on the issue:

https://github.com/apache/storm/pull/2433
  
Here's patch for supporting old ZK mechanism.


https://github.com/HeartSaVioR/storm/commit/79aa12b78591aa04c1ef8855032859318a83185d

Didn't test manually, just for early review.

@danny0405 Could you take a look at and pull if it's OK? It would be nice 
if you have time to do some manual tests, but I'll try to spend time to get it, 
so don't worry if you don't have time. No need to preserve authorship when 
pulling.


---


[GitHub] storm issue #2433: [STORM-2693] Heartbeats and assignments promotion for sto...

2018-01-14 Thread danny0405
Github user danny0405 commented on the issue:

https://github.com/apache/storm/pull/2433
  
@erikdw
> For storm-on-mesos, the Supervisor + Workers run in a per-topology 
container on each host. The > Supervisor is the container's init process, so if 
it dies then the Workers die with it. So the
> problematic scenario you outlined doesn't exist for that use-case.

This sounds great, maybe a specified port is a better choice compared to 
range of ports.


---


[GitHub] storm issue #2433: [STORM-2693] Heartbeats and assignments promotion for sto...

2018-01-14 Thread erikdw
Github user erikdw commented on the issue:

https://github.com/apache/storm/pull/2433
  
/quote My consideration is the workers started by parent supervisor, how do 
the workers know their parent supervisor's heartbeat/assignment port? If i 
passed it as starting argument, the port will be invalid if its parent 
supervisor collapse or restarts.

For storm-on-mesos, the Supervisor + Workers run in a per-topology 
container on each host.  The Supervisor is the container's init process, so if 
it dies then the Workers die with it.  So the problematic scenario you outlined 
doesn't exist for that use-case.


---


[GitHub] storm issue #2433: [STORM-2693] Heartbeats and assignments promotion for sto...

2018-01-14 Thread danny0405
Github user danny0405 commented on the issue:

https://github.com/apache/storm/pull/2433
  
@HeartSaVioR @erikdw 
Yeah, `-c supervisor.thrift.port=` will works great for starting 
supervisor on Mesos.

My consideration is the workers started by parent supervisor, how do the 
workers know it's parent heartbeat port? If i passed it as starting argument, 
the port will be invalid if its parent supervisor collapse or restarts.

So who will maintain the consistency of heartbeats port for workers' parent 
supervisor ?



---


[GitHub] storm issue #2433: [STORM-2693] Heartbeats and assignments promotion for sto...

2018-01-12 Thread erikdw
Github user erikdw commented on the issue:

https://github.com/apache/storm/pull/2433
  
@HeartSaVioR : really @revans2 noticed the change's implications for 
storm-on-mesos, so he should get the credit. :-)  He's rightly suggested that 
we create some tests to codify everything that might break us -- that is 
understandably difficult, but I'm collecting a list of things we'll need to 
test for.

And awesome, `-c supervisor.thrift.port=` will work great for us, 
thanks for confirming!  (I was able to look through the `bin/storm.py` code and 
see that the `-c/--config` handling ends up putting elements into 
`-Dstorm.options=` for any executed storm java class.)


---


[GitHub] storm issue #2433: [STORM-2693] Heartbeats and assignments promotion for sto...

2018-01-12 Thread HeartSaVioR
Github user HeartSaVioR commented on the issue:

https://github.com/apache/storm/pull/2433
  
@erikdw 

OK, sorry @JessicaLHartog for misunderstanding, and thanks for noticing us 
to not making us unintentionally breaking the thing again.

> Is it possible to specify this setting (supervisor.thrift.port) as a CLI 
parameter to the supervisor as it is launched?

Unless there's no bug on storm binary script, `-c 
supervisor.thrift.port=` should work as same as it is described in 
storm.yaml.


---


[GitHub] storm issue #2433: [STORM-2693] Heartbeats and assignments promotion for sto...

2018-01-12 Thread erikdw
Github user erikdw commented on the issue:

https://github.com/apache/storm/pull/2433
  
@HeartSaVioR : 

> By the way, I was not aware of the discussion in storm-mesos, so don't 
know which works should be done in Storm side, and how these are coupled with 
"this issue". Maybe better to only pointing out relevant things, and discuss in 
dev. mailing list if necessary for things out of topic.

I think you're misinterpreting [Jessica's 
comments](https://github.com/apache/storm/pull/2433#issuecomment-357140893).  
She wasn't trying to bog down this discussion with an attempt to resolve the 
other breakages of storm-on-mesos, I believe she was just providing some 
context to Bobby's early 
statement](https://github.com/apache/storm/pull/2433#issuecomment-356661822) 
about storm-on-mesos being broken a few times by changes like this in 
storm-core.  Furthermore, I believe she was also clarifying that storm-on-mesos 
is currently *totally* broken (as explained in the context she provided) in the 
1.1+ branches, so this proposed change technically is not *breaking* 
storm-on-mesos, since it's already broken.  Our goal is simply to prevent even 
more breaking changes.

> As far as I understand in your comment, only concern with this issue is 
specifying Supervisor's thrift port, which shouldn't be random in range but 
just using specified port. If I understand correctly, the patch already does 
that (via configuration), and storm-mesos could launch Supervisor instance with 
overriding supervisor thrift port. Makes sense?

That has potential to work -- can you please clarify something though?  Is 
it possible to specify this setting (`supervisor.thrift.port`) as a CLI 
parameter to the supervisor as it is launched?  If that works then awesome, 
that means the requirement Jessica outlined is already satisfied since we can 
simply specify that setting when we launch each supervisor.  However, if the 
option instead must be in the storm configuration yaml file, then it is 
insufficient.  That is because storm-on-mesos *must* be able to have different 
ports for every supervisor, but every supervisor shares the same config file in 
storm-on-mesos.


---


[GitHub] storm issue #2433: [STORM-2693] Heartbeats and assignments promotion for sto...

2018-01-12 Thread HeartSaVioR
Github user HeartSaVioR commented on the issue:

https://github.com/apache/storm/pull/2433
  
@danny0405 
TODO 2: That's same as my understanding. This patch becomes depending on 
https://issues.apache.org/jira/browse/STORM-2898. If you are familiar with 
security please take a look at the issue and comment as well.

TODO 3: Yes I'm willing to do it, but sadly I have some higher priority and 
even urgent works outside of Storm project, so I can't promise I will start it 
in this month. 
(There's a holiday week in China and Korea in next month btw.) 

Btw, I'm not sure but I guess STORM-2898 would take even couple of months 
in worst case, so I guess it is OK.


---


[GitHub] storm issue #2433: [STORM-2693] Heartbeats and assignments promotion for sto...

2018-01-12 Thread HeartSaVioR
Github user HeartSaVioR commented on the issue:

https://github.com/apache/storm/pull/2433
  
@JessicaLHartog 
Thanks for noticing! By the way, I was not aware of the discussion in 
storm-mesos, so don't know which works should be done in Storm side, and how 
these are coupled with this issue. 

As far as I understand in your comment, only concern with this issue is 
specifying Supervisor's thrift port, which shouldn't be random in range but 
just using specified port. If I understand correctly, the patch already does 
that (via configuration), and storm-mesos could launch Supervisor instance with 
overriding supervisor thrift port. Makes sense?

cc. @danny0405 FYI.


---


[GitHub] storm issue #2433: [STORM-2693] Heartbeats and assignments promotion for sto...

2018-01-11 Thread JessicaLHartog
Github user JessicaLHartog commented on the issue:

https://github.com/apache/storm/pull/2433
  
@danny0405 @revans2 Thanks for keeping the Storm on Mesos project in mind. 
We (@erikdw, @srishtyagrawal, and myself) appreciate it!

Right now, Storm on Mesos doesn't work for Storm versions 1.1.x and above 
because we rely on the slot-centric scheduling instead of the now 
supervisor-centric scheduling (for more details see 
[STORM-2126](https://issues.apache.org/jira/browse/STORM-2126)). We talked to 
@revans2 about this in our 
issue[#222](https://github.com/mesos/storm/issues/222#issuecomment-352514556)...
 so that's a bit of context that may be necessary to understand the problem a 
little better.

With respect to the ask to not further break Storm on Mesos we think it 
would be best for us to be able to specify (at runtime) to the supervisor which 
port it needs to listen on for these heartbeat messages. The way we see it, 
while the Supervisor being able to specify a range for the heartbeat port 
sounds like it may work, the Supervisor should _also_ be able to accept an 
assigned heartbeat port. Namely, once we adapt to the changes suggested 
[here](https://github.com/mesos/storm/issues/222#issuecomment-352530608), we 
can start each Supervisor using one of the ports Mesos offers to us.


---


[GitHub] storm issue #2433: [STORM-2693] Heartbeats and assignments promotion for sto...

2018-01-11 Thread danny0405
Github user danny0405 commented on the issue:

https://github.com/apache/storm/pull/2433
  
@HeartSaVioR @revans2
Based on your comments, i thought there are mainly 3 TODOs for this patch:
1. Supervisor container-blity promotion, like support multiple 
supervisor instances on one machine.
2. Assignments security should be guaranteed.
3. Backwards compatibility for old version storm workers.

For TODO 1 I can make supervisor thrift ports picked in a range for a 
machine node, and nimbus aware the port-info from SupervisorHeartbeats. Also i 
will passed the port as an start up argument so that workers will know its 
parent supervisor port.

For TODO 2 i understood i should wait for @revans2's token authentication 
right?

For TODO 3 i don't know how much worker would be taken, if @HeartSaVioR can 
help to contribute it will very appreciate it.

Also the supervisor local disk worker heartbeats can also be removed 
actually, and it's easy to achieve for the patch, should i also do this?


---


[GitHub] storm issue #2433: [STORM-2693] Heartbeats and assignments promotion for sto...

2018-01-11 Thread revans2
Github user revans2 commented on the issue:

https://github.com/apache/storm/pull/2433
  
@HeartSaVioR I just posted a high level design for a delegation tokens 
[here](https://issues.apache.org/jira/browse/STORM-2898?focusedCommentId=16323160=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-16323160)

Please take a look.  If it looks good I will start throwing together a 
patch based on it.  The we can get into some of the specifics associated with 
it.


---


[GitHub] storm issue #2433: [STORM-2693] Heartbeats and assignments promotion for sto...

2018-01-11 Thread HeartSaVioR
Github user HeartSaVioR commented on the issue:

https://github.com/apache/storm/pull/2433
  
@revans2 
Thanks for double checking. I think you've also suggested removing the 
behavior leveraging disk to communicate between worker and supervisor (so that 
the way of communication is consistent), but you don't think it's not 
mandatory. Got it.


---


[GitHub] storm issue #2433: [STORM-2693] Heartbeats and assignments promotion for sto...

2018-01-11 Thread revans2
Github user revans2 commented on the issue:

https://github.com/apache/storm/pull/2433
  
@HeartSaVioR 

Perhaps we should have a call on the Metrics V2 work and what is its status 
for storm 2.0 because I don't totally know myself.

What I think is mandatory for this patch is.

1) wait for delegation token work to go in so we can authenticate 
connections between the worker and nimbus/supervisors.
2) Add in some form of authorization to the newly added APIs. 
   - getSupervisorAssignments should only be allowed from a supervisor
   - sendSupervisorWorkerHeartbeats should only be allowed from a 
supervisor.
   - sendSupervisorWorkerHeartbeat should verify that it has come from the 
owner of the topology the heartbeat is for
   - sendSupervisorAssignments needs to verify that it came from nimbus.
   - getLocalAssignmentForStorm needs to verify that it came from the owner 
of the topology.
   - sendSupervisorWorkerHeartbeat needs to verify that it came from the 
owner of the topology.
3) supervisor needs to pick a port from a configured allowable range of 
ports and get that information to everyone who is going to need it.

If someone wants to drop the old heartbeat mechanisms for talking to nimbus 
and the supervisors that is fine with me.  However, if we do drop it I really 
would love to have a way to maintain backwards compatibility because otherwise 
I will have to add it back in myself.

If we don't drop it now I would like to see a follow on JIRA to remove it, 
but for that to work as part of a rolling upgrade we would need to support both 
mechanisms at the same time.


---


[GitHub] storm issue #2433: [STORM-2693] Heartbeats and assignments promotion for sto...

2018-01-11 Thread HeartSaVioR
Github user HeartSaVioR commented on the issue:

https://github.com/apache/storm/pull/2433
  
@revans2 
Thanks again for your detailed explanation and taking up the hard security 
work which makes the patch stuck to go on. 

I see you are thinking about deprecating pacemaker after addressing the 
metrics load in ZK, then you are totally right we may want to avoid making it 
as default. At least I think now you are OK to go with this patch with some 
follow-up works, then metrics load in ZK would be removed when we finish all 
the necessary works and merge this in, which is OK for me.

TBH I'm still unsure the migration plan from Metrics V1 to V2 (including 
built-in) since it also opens the possibility for metrics to be transferred in 
different way (maybe via metrics reporter), but that might be out of topic for 
this patch, and we could discuss that after merging Metrics V2 to both 1.x and 
master.

Please double-check my understanding. If my understanding is right, you're 
suggesting to modify below things:

1. (a little unsure whether it is also mandatory for non-container) modify 
supervisor to pick a free port (range should be configurable) for thrift server 
instead of configured static value
2. change supervisor to include picked port information in heartbeat, and 
change nimbus to leverage the port information instead of configured static 
value for communicating via thrift RPC
3. change worker to report heartbeat to supervisor via RPC, not via local 
state (disk) which might be problematic from storm-mesos or other container 
solutions
4. address heartbeat fail-back mechanism for old version workers (reading 
from ZK)

@danny0405 
Could you check comments from @revans2 and provide inputs? Please also let 
me know if you would not want to deal with supporting for old version workers. 
I'll see I can address it on top of your patch.


---


[GitHub] storm issue #2433: [STORM-2693] Heartbeats and assignments promotion for sto...

2018-01-11 Thread revans2
Github user revans2 commented on the issue:

https://github.com/apache/storm/pull/2433
  
I was also just thinking that it would be really great if in the future we 
could drop even more access from the workers to zookeeper.  If assignments are 
gone and heartbeats go away all that is left is credentials, backpressure, and 
possibly a few other small things.  But backpressure is going to go away too so 
that would just be some small odds and ends.  If we could add in a return 
object to the heartbeats, we could in the future add in support for fetching 
changed credentials, etc. and not have to worry about zookeeper except in 
client side code.

So all I am asking is if for the heartbeat thrift calls if we could make it 
so the result code is an empty object and not void.  That way we have a clean 
way to expand it in the future.


---


[GitHub] storm issue #2433: [STORM-2693] Heartbeats and assignments promotion for sto...

2018-01-11 Thread revans2
Github user revans2 commented on the issue:

https://github.com/apache/storm/pull/2433
  
So I just spoke with management here and did some quick back of the 
envelope planning and I think to unblock this we should support delegation 
token like functionality in storm.  This could potentially make life a lot 
simpler for all kinds of things.  Here and with DRPC, etc.

I am willing to commit my team to make this happen, so I will file some 
JIRAs and try to put together a plan/architecture that hopefully others can 
review.

Once we have delegation tokens working the only real issue is going to be 
containerized supervisors.  I think we can support that by having the 
supervisor pick a free port in a configured range, and then include that port 
in it's heartbeat to nimbus.  We would also need a way to tell the workers what 
port to use to communicate with the supervisor.

For me I really would like to be able to maintain the ability to run 0.10.x 
and 1.x topologies under a 2.x cluster.  I think this would only require still 
checking for heartbeats from zookeeper before scheduling which I don't think 
has been removed yet, so I am hopeful that it will work with the current patch.


---


[GitHub] storm issue #2433: [STORM-2693] Heartbeats and assignments promotion for sto...

2018-01-11 Thread revans2
Github user revans2 commented on the issue:

https://github.com/apache/storm/pull/2433
  
@HeartSaVioR

I didn't really want to make pacemaker the default because not everyone 
needs it, standing up yet another server is a pain for customers, and I was 
hoping it would be temporary.  Once we have the metrics out of the heartbeat, 
which yes I hope to do in conditionally in 2.x and completely remove in 3.x, we 
can leave simple heartbeats in ZK, or move to push the heartbeats directly to 
nimbus, whichever is fine with me.  Then there would be no need for pacemaker 
and it could be deprecated.

Hopefully I answered most of your other questions in my response to 
@danny0405 
[above](https://github.com/apache/storm/pull/2433#issuecomment-356965437)


---


[GitHub] storm issue #2433: [STORM-2693] Heartbeats and assignments promotion for sto...

2018-01-11 Thread revans2
Github user revans2 commented on the issue:

https://github.com/apache/storm/pull/2433
  
@danny0405 

1.  The issues we saw with stability were around when a pacemaker nodes 
goes down that it was causing exceptions in the clients that we were not 
handling properly resulting in workers restarting.  We have not seen any issues 
with heartbeat packets being discarded.  If this did not cover the issues you 
were seeing I really would love to have a JIRA so I can fix it.

We are not running with 2.x, or even 1.x, in production so I cannot say if 
there is some oddness happening with what we have pushed back, or perhaps its 
interactions with HA.  We are on 0.10.2, we pulled back a lot from 1.x.  
This is why we really want to get to 2.x so we can we aligned with the 
community again and hopefully not have these kinds of issues.  There may be 
bugs we don't realize right now.

2. With pacemaker HA if you have 2+ pacemaker servers each of the clients 
will randomly select one of the servers to send heartbeats to.  If the one they 
try to write to is down, at the beginning or in the middle, the heartbeats 
should then start going to a different, random, server.  This should hopefully 
keep the load even between the pacemaker servers.  Nimbus is supposed to work 
all of this out by reading from all the servers, and if it finds more than one 
heartbeat for a worker it will pick the one that has the newest timestamp in 
it.  This does not scale well on the nimbus side, and can take more then 2 mins 
to download all of the heartbeats, so we have plans to parallelize the download.

The metric don't go to the supervisor, as it does not need/use them 
currently.  It only cares if the worker is up and still alive, so it knows if 
it needs to restart it.

3. I totally believe you that this can support a large cluster.  Like I 
said this is a much better solution long term, and I would love to go this 
route.  We just need to fix the security issues and find a way to support 
containerized supervisors for me to give it a +1. Both should be doable.

4. There is no security between the workers and pacemaker.  There is 
security between nimbus and pacemaker.  This means that only nimbus can see the 
heartbeats.  The worst you can do with faking heartbeats is confuse someone 
with bad metrics (not ideal) or trick nimbus into thinking a worker is still 
alive when it is not, bad but not horrible.  It is the assignment portion that 
is scary to me, because it says what to run.  If we pull the assignment portion 
out I would be OK with that.  Although it would be best to truly fix it because 
we don't have a way to selectively turn off authorization in thrift so to make 
that work we would need a separate thrift server on nimbus, which I would 
rather not do.

I would love to see the ability to do delegation tokens in storm for 
authentication.  This is no small task.  It would take a lot of work, 
especially with HA, which is why I haven't done it.


---


[GitHub] storm issue #2433: [STORM-2693] Heartbeats and assignments promotion for sto...

2018-01-10 Thread danny0405
Github user danny0405 commented on the issue:

https://github.com/apache/storm/pull/2433
  
@HeartSaVioR  @revans2
1. Yeah, we did used pacemaker for a while for a cluster about 200 
topologies, but the workers restart frequently just because the Pacemaker 
heartbeats packets discard.
2. Also the pacemaker is a single point for the cluster, there is even no 
HA for it, when pacemaker restart, it will take a long time to recover 
heartbeats for it[even it has a HA], then most of the workers will time out and 
be reassigned by master. I raise doubts about keeping heartbeats into just one 
single point, and it is hard to scale laterally. @revans2 said that your 
cluster has 900 supervisors but only 120 topologies which push heatbeats to 
pacemaker. So do we have a index/metics between pacemaker and 
workers/topologies while not supervisors ?
3. This patch can really support large cluster, and it is very stable for 
out production, we have about 8000 topologies, and the new patch can support at 
least 2000 topologies at least for now. Also the patch has no single point 
problem compared to Pacemaker, and the heartbeat is very lightweight.
4. If we just have security problem, i can fix it.



---


[GitHub] storm issue #2433: [STORM-2693] Heartbeats and assignments promotion for sto...

2018-01-10 Thread HeartSaVioR
Github user HeartSaVioR commented on the issue:

https://github.com/apache/storm/pull/2433
  
@revans2 
First of all, I missed Storm on containers and around security. Thanks for 
the pointer. Much helped even in my side.

Looks like there are some things to sort out.

1.
Please take a look at https://issues.apache.org/jira/browse/STORM-2693 and 
comments. I've already proposed @danny0405 to try out Pacemaker, and he said it 
didn't help in some cases. Looks like operational experience regarding 
Pacemaker between you and @danny0405 are somewhat different. 

https://issues.apache.org/jira/browse/STORM-2693?focusedCommentId=16146530=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-16146530

Maybe @danny0405 is using old version of Storm, or there might be some 
internal patches not contributed to Apache side. One thing to sort out, but if 
comment from @danny0405 is valid, that means Pacemaker is not working as 
expected, and that would be a critical issue for Pacemaker. I don't have huge 
scale cluster so I haven't manage cluster with Pacemaker.

2.
We know we utilize ZK in some bad way (write goes heavier whenever 
workers/topologies are being added), and we introduce Pacemaker as a nice 
mitigation.
If we think Pacemaker is now really stable and have all of essential 
functionalities, we may need to consider making it as default, or publicizing 
Pacemaker more via guiding when to consider using Pacemaker.
Without making it as a default or any guide to consider Pacemaker while 
sizing, users would normally struggle ZK issue first (got bad experience 
already), and try out Pacemaker as an alternative.

3.
The bigger concern for me between metrics and assignment is metrics. I 
don't know how much assignment via ZK affects the performance (if it hurts much 
it should be also considered), but it is really clear that worker metrics in ZK 
has been problematic for us.

I believe we will (and should) eventually drop current heartbeat structure 
which includes metrics, and the sooner the better.

What I have been not clear is how and when. From that point I have been 
expecting that Metrics V2 will take up the issue, and unfortunately, based on 
the current patch of Metrics V2, we would probably still use Metrics V1 for 
built-in metrics in Storm 2.0.0 unless we have separate patch for Metrics V2.

We should have a plan to migrate built-in metrics from Metrics V1 to 
Metrics V2, because there would be some more TODOs to make it done, and it 
can't be done partially (especially 2 and 3).

1. Implement worker metrics reporter which reports to Nimbus.
2. Change Nimbus to get metrics from metrics store instead of heartbeat, 
which makes UI leverage the metrics.
3. Migrate built-in metrics from Metrics V1 to Metrics V2: after the patch 
built-in metrics will not be presented to metric consumer (Metrics V1).

The above work would be backward incompatible, but we have no plan for 
Storm 2.0.0, and I'd rather not thinking about 3.0.0 even we don't have Storm 
2.0.0. It is ideal to be done before Storm 2.0.0, and if it's not possible, I 
wouldn't mind introducing it in 2.x with disregarding backward compatibility.

I think it is related to the change opened partially and now internally 
testing in Oath. Could you share a plan for this so that community can 
determine whether community should wait on news or there're some tasks Storm 
community should work on?

4. 
We may want to sort out which should be secure and which are not. Does 
metric need to be secured? If then same security issue will arise when 
implementing worker metrics reporter.


---


[GitHub] storm issue #2433: [STORM-2693] Heartbeats and assignments promotion for sto...

2018-01-10 Thread revans2
Github user revans2 commented on the issue:

https://github.com/apache/storm/pull/2433
  
@HeartSaVioR I think I must have mixed this patch up with another patch so 
yes I am wrong if this goes in backwards compatibility is in trouble.

@danny0405 
I have been looking through the code and this is a rather large change, 
could you explain to me at a high level how the architecture is changing?  I 
see that there is a SupervisorClient now. 

It looks like the worker tries to talk to the supervisor through this 
client sending heartbeats and asking for its current assignment.  Also the 
worker can fall back to talking to nimbus for the heartbeats.

I also see that nimbus will use it to try and send out new assignments to 
the Supervisors as well.

Why do we heartbeat to the supervisor in 2 different ways?  We still go 
through the local file system, but also through thrift with a fallback to 
nimbus if it fails.  The implementation of this thrift heartbeat handler in the 
Supervisor is a noop.  Is the plan to eventually stop going through the file 
system?

My biggest concern is around security.  In the current security model the 
workers have no way to communicate with Nimbus, or for that matter the new 
Supervisor Thrift Server.  Nimbus/Supervisor only supports Kerberos 
authentication and topologies are not guaranteed to have kerberos credentials 
available to them.  For Zookeeper we use a username/password that is randomly 
generated per topology to protect access.  When going through the file system 
to the supervisor we explicitly setup the file system permissions to only allow 
the owner of the topology + the supervisor user to be able to read and write to 
the heartbeat directory.  This does none of those.  The new Thrift APIs do 
nothing to restrict who has access to the them, so anyone can download 
assignments and anyone can heartbeat in for anyone else.  I can think of a few 
malicious things that an attacker could do with this power. Even if we do have 
proper authorization if I turn on Kerberos authentication those communication c
 hannels are now shut down to the worker because it cannot authenticate through 
Kerberos and hence cannot get past SASL to do anything.

I am also concerned about Storm on Mesos with this patch.  We have broken 
them a lot of times because we don't think about them enough, and this is going 
to be another one of those cases.  Each supervisor must be on a given port for 
the cluster to work in this setup.  On Mesos and probably on Yarn too, there is 
the possibility of multiple supervisors running on a single physical host.  
Storm on mesos already jumps through all kinds of hoops to make this work by 
resetting the directories where the supervisors store things to avoid any 
collisions. With a port that cannot change because nimbus does not support 
talking to it on an ephemeral port I don't see a way to make 2 supervisors 
happy on the same box.

Now please don't think I am beating on this patch because I would much 
rather have pacemaker.  I think the overall direction that this patch is going 
in is much much cleaner than what we are doing with pacemaker/zookeeper.  There 
is just potentially a lot of work to really get this working in a secure and 
compatible way.  We would have to support something like hadoop delegation 
tokens in storm so the workers could talk to nimbus and we could guarantee in 
all cases that they would be there an up to date.  We would also have to have a 
way for the supervisors to support something similar.

But I would encourage you to take another look at pacemaker in the 
short-term, at least until we have all of these issues worked out.  We 
currently have clusters running using pacemaker with 900 supervisors and over 
120 topologies in a secure and stable way.  We did run into stability issues 
with pacemaker at the beginning, but we believe that we have fixed all of them. 
 If you are having crashes please let us know and we will do our best to fix 
them.  We also are currently experiencing issues with the time it takes to 
download heartbeats.  We have a plan to parallelize it, but just have not put 
it in place yet.

Another thing to think about is that we are likely going to be able to 
remove the metrics from the ZK heartbeats once we have an alternative path for 
them.  This should let us reduce the load on ZK/pacemaker massively and let us 
scale even better.


---


[GitHub] storm issue #2433: [STORM-2693] Heartbeats and assignments promotion for sto...

2018-01-09 Thread HeartSaVioR
Github user HeartSaVioR commented on the issue:

https://github.com/apache/storm/pull/2433
  
@danny0405 
Yeah I understand it is not ideal. Let's wait for more inputs. If the new 
behavior (for old workers, when supervisor killed) is acceptable we can just go 
on. If it isn't, the patch may not be accepted without mitigation whether it is 
a part of the patch or follow-up patch, so even I should go ahead and implement 
my suggestion based on your patch.

If you are planning for exploring more based on your patch and have a 
design doc, it would be really nice to share it to Storm dev mailing list. It 
may help persuade your opinion to Storm community.


---


[GitHub] storm issue #2433: [STORM-2693] Heartbeats and assignments promotion for sto...

2018-01-09 Thread danny0405
Github user danny0405 commented on the issue:

https://github.com/apache/storm/pull/2433
  
@HeartSaVioR 
Yes, this is the case, but i don't suggest to mix two kinds of heartbeats 
aware logics in Nimbus, it's too heavy for scheduling[with old workers] and 
this is not the intention of this patch.

I don't think it is worthy of making the mix for the little difference of 
workers.


---


[GitHub] storm issue #2433: [STORM-2693] Heartbeats and assignments promotion for sto...

2018-01-09 Thread HeartSaVioR
Github user HeartSaVioR commented on the issue:

https://github.com/apache/storm/pull/2433
  
@revans2 
I'm going through workarounds here. Could you review this as well, and 
provide review comments/suggestions? It would be really appreciated if you 
could test the patch against old workers.


---


[GitHub] storm issue #2433: [STORM-2693] Heartbeats and assignments promotion for sto...

2018-01-09 Thread HeartSaVioR
Github user HeartSaVioR commented on the issue:

https://github.com/apache/storm/pull/2433
  
@danny0405 
OK. I have been switching the context from multiple projects/issues recent 
days so may miss the detail on this patch, and even confusion with current 
implementation (sad).

Could you verify if I understand your statements correctly?

1. Older Worker is already leaving heartbeats to local state and newer 
Supervisor can leverage them to report to newer Nimbus hence no need to do 
additional work on that.
2. Older worker can't report its heartbeat to newer Nimbus directly, hence 
newer Nimbus can't get older workers' heartbeat if newer supervisor is down. <= 
This would be a major difference between older worker and newer worker for this 
patch.

If my understanding is right, looking into ZK for fail-back mechanism (for 
workers which relevant supervisor is down) might still make sense for old 
workers, which work would be not easy. 
If it is a hard requirement, let's not be smart for old workers. If we can 
identify topology version is under 2.0.0, just ignore heartbeats supervisor is 
reporting and read heartbeats from ZK. This will get rid of headache between 
aggregation between supervisor RPC and ZK.


---


[GitHub] storm issue #2433: [STORM-2693] Heartbeats and assignments promotion for sto...

2018-01-09 Thread HeartSaVioR
Github user HeartSaVioR commented on the issue:

https://github.com/apache/storm/pull/2433
  
@danny0405 
Sorry I should explain the motivation before. The motivation is well 
explained to the description on 
https://issues.apache.org/jira/browse/STORM-2448.

Explaining background as far as I know, some vendors which leverages Storm 
(via product, or cluster) are always struggling with ensuring users to upgrade 
their clusters (or topologies) smoothly. It should be ideal if all users are 
using most recent version of Storm, but it is sadly true that someone still 
even uses Storm 0.10.x in their production. 

The team which makes one of Storm's biggest contribution is also struggling 
on this. They have put noticeable efforts to RAS in Storm 2.0.0 which really 
helps utilizing huge cluster (I guess it is  centralized), but it is even not 
easy for them to persuade their all of clients to upgrade their topologies. 
(I'm not quite sure it is similar case for them but you may understand the case 
if you are working on company which number of employees goes over 1000~.)

STORM-2448 comes in as a workaround for addressing above issue. We should 
(must, I wish) drop supporting them from Storm 3.0.0, but we may need to live 
with this for now.


---


[GitHub] storm issue #2433: [STORM-2693] Heartbeats and assignments promotion for sto...

2018-01-09 Thread danny0405
Github user danny0405 commented on the issue:

https://github.com/apache/storm/pull/2433
  
@HeartSaVioR
But with this patch nimbus can collect both the 2.0./1.x workers 
heartbeats, this is no differences here.

We do not collect heartbeats based on workers now.


---


[GitHub] storm issue #2433: [STORM-2693] Heartbeats and assignments promotion for sto...

2018-01-09 Thread HeartSaVioR
Github user HeartSaVioR commented on the issue:

https://github.com/apache/storm/pull/2433
  
@danny0405 
We have SUPERVISOR_WORKER_VERSION_CLASSPATH_MAP in Config, which is a map 
representing supported versions in supervisor. Assuming all the nodes have 
consistent configuration, Nimbus could check the value of map while initiating 
and determine whether the cluster supports Storm versions under 2.0.0.


https://github.com/apache/storm/blob/466a7ad74da27c1250eedf412a487db409e42c19/storm-client/src/jvm/org/apache/storm/Config.java#L1529-L1551

And updated topology structure has Storm version information:


https://github.com/apache/storm/blob/466a7ad74da27c1250eedf412a487db409e42c19/storm-client/src/storm.thrift#L265-L283

we could treat `no value` as same as cluster's version.

So if Nimbus could determine topology's Storm version and apply its 
heartbeat behavior based on the version, we could read heartbeat from ZK for 
topology version under Storm 2.0.0, and we could treat heartbeats only from RPC 
for topology version same or higher than Storm 2.0.0. Not sure how much the 
work would be needed, but with this approach we don't even need to concern with 
SUPERVISOR_WORKER_VERSION_CLASSPATH_MAP.




---


[GitHub] storm issue #2433: [STORM-2693] Heartbeats and assignments promotion for sto...

2018-01-09 Thread danny0405
Github user danny0405 commented on the issue:

https://github.com/apache/storm/pull/2433
  
@HeartSaVioR 
Actually, this patch can work with old workers if we upgrade all the 
daemons: Nimbus/Supervisors, cause supervisors collect local heartbeats and 
report to Nimbus through RPC.

The only actions diff of old/new workers is that: old workers will not 
report tick hb to local Supervisor while new workers will do.  When new 
supervisor daemon die with old workers, nimbus will sense it and fire a 
timed-out reassign. 


---


[GitHub] storm issue #2433: [STORM-2693] Heartbeats and assignments promotion for sto...

2018-01-08 Thread danny0405
Github user danny0405 commented on the issue:

https://github.com/apache/storm/pull/2433
  
@HeartSaVioR I can do this
But what is the motivation to let 2.0.0 daemons interact with Storm 1.x and 
0.10.x workers?
Another issue: is there already an option i can reuse or i need to fire a 
new one?


---


[GitHub] storm issue #2433: [STORM-2693] Heartbeats and assignments promotion for sto...

2018-01-08 Thread HeartSaVioR
Github user HeartSaVioR commented on the issue:

https://github.com/apache/storm/pull/2433
  
@danny0405 
Could you rebase?

And I'm sorry but I'd like to ask a favor of including option for 
compatibility mode for supporting Storm 1.x and 0.10.x. We have applied 
[STORM-2448](https://issues.apache.org/jira/browse/STORM-2448) which guarantees 
Storm 2.0.0 daemons can interact Storm 1.x and 0.10.x workers. I have 
initialized discussion to revisit that issue, but we may still want to have 
STORM-2448, so we need to have a compatibility mode for now.

Could we let Nimbus also reads the ZK heartbeat like before when the option 
is turned on?


---


[GitHub] storm issue #2433: [STORM-2693] Heartbeats and assignments promotion for sto...

2017-12-28 Thread HeartSaVioR
Github user HeartSaVioR commented on the issue:

https://github.com/apache/storm/pull/2433
  
@revans2 
I'm feeling that the feature may effectively disable compatibility feature 
- Storm 2.0.0 daemons to work with Storm 1.x/0.10.x workers - since heartbeat 
mechanism is changed. I hope it is OK to all of us, and personally I don't want 
to hesitate introducing backward-incompatible changes in major version to not 
miss out valuable efforts.


---


[GitHub] storm issue #2433: [STORM-2693] Heartbeats and assignments promotion for sto...

2017-12-26 Thread danny0405
Github user danny0405 commented on the issue:

https://github.com/apache/storm/pull/2433
  
@HeartSaVioR Really thx for nice review work, it duration is long and the 
patch is huge, thx for your patience.

I have fixed the comments not addressed.

Nimbus will have a debug log if assignments/heartbeats-report error.


---


[GitHub] storm issue #2433: [STORM-2693] Heartbeats and assignments promotion for sto...

2017-12-26 Thread HeartSaVioR
Github user HeartSaVioR commented on the issue:

https://github.com/apache/storm/pull/2433
  
I have done manual tests with multiple nodes (5 nodes) with Nimbus H/A (2 
nodes): switching nimbus leader, killing supervisor, killing workers.

I'm +1 once my remaining review comments are addressed afterwards.


---


[GitHub] storm issue #2433: [STORM-2693] Heartbeats and assignments promotion for sto...

2017-12-26 Thread HeartSaVioR
Github user HeartSaVioR commented on the issue:

https://github.com/apache/storm/pull/2433
  
Manual tests with a node works nicely. (killing nimbus, killing supervisor, 
killing workers)
Will work on manual tests with small cluster. (3 or 5 nodes)

I still would want to wait for other reviewers to review the code before 
giving +1.


---


[GitHub] storm issue #2433: [STORM-2693] Heartbeats and assignments promotion for sto...

2017-12-22 Thread danny0405
Github user danny0405 commented on the issue:

https://github.com/apache/storm/pull/2433
  
@HeartSaVioR i have fix the building slow issue, it is because during the 
nimbus-test clojure testing, i start all the workers every time after 
submitting topology, in the old mode, when the cluster attribute 
SUPERVISOR-ENABLE is turned false, no workers will be started in local.

This is also the reason why there are 2 TODOs in nimbus-test with new mode, 
i have fixed it together.

I turned of master forward assignments distribution and just let supervisor 
sync when time simulation is turned on for local mode.

So this is fixed.


---


[GitHub] storm issue #2433: [STORM-2693] Heartbeats and assignments promotion for sto...

2017-12-20 Thread danny0405
Github user danny0405 commented on the issue:

https://github.com/apache/storm/pull/2433
  
@HeartSaVioR i have fixed the checkstyle issue already, yeah the storm-core 
building is slow, i will check the reason.


---


[GitHub] storm issue #2433: [STORM-2693] Heartbeats and assignments promotion for sto...

2017-12-20 Thread HeartSaVioR
Github user HeartSaVioR commented on the issue:

https://github.com/apache/storm/pull/2433
  
@danny0405 
Btw, the build in local passed, but it incurs considerable addition in 
build time.

```
[INFO] 

[INFO] Reactor Summary:
[INFO]
[INFO] Storm .. SUCCESS [  
3.699 s]
[INFO] Apache Storm - Checkstyle .. SUCCESS [  
0.708 s]
[INFO] multilang-javascript ... SUCCESS [  
0.126 s]
[INFO] multilang-python ... SUCCESS [  
0.265 s]
[INFO] multilang-ruby . SUCCESS [  
0.093 s]
[INFO] maven-shade-clojure-transformer  SUCCESS [  
2.413 s]
[INFO] storm-maven-plugins  SUCCESS [  
2.638 s]
[INFO] Storm Client ... SUCCESS [ 
44.607 s]
[INFO] storm-server ... SUCCESS [02:27 
min]
[INFO] storm-clojure .. SUCCESS [  
5.257 s]
[INFO] Storm Core . SUCCESS [23:34 
min]
[INFO] Storm Webapp ... SUCCESS [ 
11.411 s]
[INFO] storm-clojure-test . SUCCESS [  
2.185 s]
[INFO] storm-submit-tools . SUCCESS [  
8.076 s]
[INFO] flux ... SUCCESS [  
0.093 s]
[INFO] flux-wrappers .. SUCCESS [  
0.354 s]
[INFO] storm-kafka  SUCCESS [01:48 
min]
[INFO] storm-autocreds  SUCCESS [  
3.273 s]
[INFO] storm-hdfs . SUCCESS [ 
39.636 s]
[INFO] storm-hbase  SUCCESS [  
4.100 s]
[INFO] flux-core .. SUCCESS [  
4.035 s]
[INFO] flux-examples .. SUCCESS [ 
12.051 s]
[INFO] storm-sql-runtime .. SUCCESS [  
2.391 s]
[INFO] storm-sql-core . SUCCESS [01:47 
min]
[INFO] storm-sql-kafka  SUCCESS [  
4.736 s]
[INFO] storm-redis  SUCCESS [  
5.799 s]
[INFO] storm-sql-redis  SUCCESS [  
8.231 s]
[INFO] storm-mongodb .. SUCCESS [  
0.826 s]
[INFO] storm-sql-mongodb .. SUCCESS [  
3.936 s]
[INFO] storm-sql-hdfs . SUCCESS [ 
12.313 s]
[INFO] sql  SUCCESS [  
0.150 s]
[INFO] storm-hive . SUCCESS [ 
18.669 s]
[INFO] storm-jdbc . SUCCESS [  
1.777 s]
[INFO] storm-eventhubs  SUCCESS [  
3.725 s]
[INFO] storm-elasticsearch  SUCCESS [ 
20.401 s]
[INFO] storm-solr . SUCCESS [  
1.515 s]
[INFO] storm-metrics .. SUCCESS [  
0.590 s]
[INFO] storm-cassandra  SUCCESS [01:01 
min]
[INFO] storm-mqtt . SUCCESS [ 
22.396 s]
[INFO] storm-kafka-client . SUCCESS [02:06 
min]
[INFO] storm-opentsdb . SUCCESS [  
0.664 s]
[INFO] storm-kafka-monitor  SUCCESS [  
2.430 s]
[INFO] storm-kinesis .. SUCCESS [  
1.281 s]
[INFO] storm-druid  SUCCESS [  
1.958 s]
[INFO] storm-jms .. SUCCESS [ 
12.099 s]
[INFO] storm-pmml . SUCCESS [  
0.267 s]
[INFO] storm-rocketmq . SUCCESS [ 
11.298 s]
[INFO] blobstore-migrator . SUCCESS [ 
12.028 s]
[INFO] Storm Integration Test . SUCCESS [  
0.696 s]
[INFO] storm-starter .. SUCCESS [ 
19.945 s]
[INFO] storm-loadgen .. SUCCESS [  
2.457 s]
[INFO] storm-mongodb-examples . SUCCESS [  
0.771 s]
[INFO] storm-redis-examples ... SUCCESS [  
1.096 s]
[INFO] storm-opentsdb-examples  SUCCESS [  
1.489 s]
[INFO] storm-solr-examples  

[GitHub] storm issue #2433: [STORM-2693] Heartbeats and assignments promotion for sto...

2017-12-20 Thread HeartSaVioR
Github user HeartSaVioR commented on the issue:

https://github.com/apache/storm/pull/2433
  
@danny0405 Please let me know when you are finished and the PR is ready to 
review again. Thanks a lot for the patience!


---


[GitHub] storm issue #2433: [STORM-2693] Heartbeats and assignments promotion for sto...

2017-11-22 Thread HeartSaVioR
Github user HeartSaVioR commented on the issue:

https://github.com/apache/storm/pull/2433
  
Please fix the checkstyle issue on `storm-client` or "temporary" change max 
violation count so that Travis CI can build the code properly. Please also note 
that max violation count for all of modules should not increase before merging 
this in.


---


[GitHub] storm issue #2433: [STORM-2693] Heartbeats and assignments promotion for sto...

2017-11-22 Thread HeartSaVioR
Github user HeartSaVioR commented on the issue:

https://github.com/apache/storm/pull/2433
  
@danny0405 
Thanks for updating your great effort!
The change is huge, and unfortunately I can't find time to review this at 
once. I'll try to review partially and multiple times.


---