Re: [VOTE] Apache Helix 0.9.9 Release

2021-11-15 Thread kishore g
+1

On Mon, Nov 15, 2021 at 12:19 PM Junkai Xue  wrote:

> Hi,
>
> This is to call for a vote on releasing the following candidate as
> Apache Helix 0.9.9. This is the 23rd release of Helix as an Apache
> project, as well as the 19th release as a top-level Apache project.
>
> Apache Helix is a generic cluster management framework that makes it
> easy to build partitioned and replicated, fault-tolerant and scalable
> distributed systems.
>
> Release notes:
>- Add additional ZK serializer configuration to active ZNRecord
> compression even if the node size is smaller than write size limit.
>
> Release artifacts:
> https://repository.apache.org/content/repositories/orgapachehelix-1047
> Distribution:
> * binaries:https://dist.apache.org/repos/dist/dev/helix/0.9.9/binaries/*
> sources:https://dist.apache.org/repos/dist/dev/helix/0.9.9/src/
> The 0.9.9 release
> tag:
> https://git-wip-us.apache.org/repos/asf?p=helix.git;a=tag;h=refs/tags/helix-0.9.9
> KEYS file available here:https://dist.apache.org/repos/dist/dev/helix/KEYS
> Please vote on the release. The vote will be open for at least 72 hours.
>
> [+1] -- "YES, release"
> [0] -- "No opinion"
> [-1] -- "NO, do not release"
>
> Thanks,
> The Apache Helix Team
>


Re: [VOTE] Apache Helix 0.9.8 Release

2020-10-19 Thread kishore g
+1

On Wed, Oct 14, 2020 at 9:32 PM Lei Xia  wrote:

> +1
>
> On Wed, Oct 14, 2020 at 9:03 PM Hunter Lee  wrote:
>
> > +1. Thanks for putting this together!
> >
> > On Wed, Oct 14, 2020 at 6:25 PM Wang Jiajun 
> > wrote:
> >
> > > Hi,
> > >
> > > This is to call for a vote on releasing the following candidate as
> Apache
> > > Helix 0.9.8. This is the 23rd release of Helix as an Apache project, as
> > > well as the 19th release as a top-level Apache project. This release is
> > > supporting the customers who are using the 0.9 series.
> > >
> > > Apache Helix is a generic cluster management framework that makes it
> easy
> > > to build partitioned and replicated, fault-tolerant and scalable
> > > distributed systems.
> > >
> > > Release notes:
> > > https://helix.apache.org/0.9.8-docs/releasenotes/release-0.9.8.html
> > >
> > > Release artifacts:
> > >
> https://repository.apache.org/content/repositories/orgapachehelix-1042/
> > >
> > > Distribution:
> > > * binaries:
> > > https://dist.apache.org/repos/dist/dev/helix/0.9.8/binaries/
> > > * sources:
> > > https://dist.apache.org/repos/dist/dev/helix/0.9.8/src/
> > >
> > > The 0.9.8 release tag:
> > >
> > >
> >
> https://git-wip-us.apache.org/repos/asf?p=helix.git;a=tag;h=refs/tags/helix-0.9.8
> > >
> > > KEYS file available here:
> > > https://dist.apache.org/repos/dist/dev/helix/KEYS
> > >
> > > Please vote on the release. The vote will be open for at least 72
> hours.
> > >
> > > [+1] -- "YES, release"
> > > [0] -- "No opinion"
> > > [-1] -- "NO, do not release"
> > >
> > > Thanks,
> > > The Apache Helix Team
> > >
> >
>


Re: [VOTE] Apache Helix 0.9.7 Release

2020-05-13 Thread kishore g
+1

On Wed, May 13, 2020 at 7:27 PM Olivier Lamy  wrote:

> +1
>
> On Tue, 12 May 2020 at 09:23, Xue Junkai  wrote:
>
> > Hi,
> >
> >
> > This is to call for a vote on releasing the following candidate as Apache
> > Helix 0.9.7. This is the 22nd release of Helix as an Apache project, as
> > well as the 18th release as a top-level Apache project. This release is
> > supporting the customers are using 0.9 series.
> >
> >
> > Apache Helix is a generic cluster management framework that makes it easy
> > to build partitioned and replicated, fault-tolerant and scalable
> > distributed systems.
> >
> >
> > Release notes:
> >
> > http://helix.apache.org/0.9.7-docs/releasenotes/release-0.9.7.html
> >
> >
> > Release artifacts:
> >
> > https://repository.apache.org/content/repositories/orgapachehelix-1039
> >
> >
> > Distribution:
> >
> > * binaries:
> >
> > https://dist.apache.org/repos/dist/dev/helix/0.9.7/binaries/
> >
> > * sources:
> >
> > https://dist.apache.org/repos/dist/dev/helix/0.9.7/src/
> >
> >
> > The 0.9.7 release tag:
> >
> >
> >
> https://git-wip-us.apache.org/repos/asf?p=helix.git;a=tag;h=refs/tags/helix-0.9.7
> >
> >
> > KEYS file available here:
> >
> > https://dist.apache.org/repos/dist/dev/helix/KEYS
> >
> >
> > Please vote on the release. The vote will be open for at least 72 hours.
> >
> >
> > [+1] -- "YES, release"
> >
> > [0] -- "No opinion"
> >
> > [-1] -- "NO, do not release"
> >
> >
> > Thanks,
> >
> > The Apache Helix Team
> >
>
>
> --
> Olivier Lamy
> http://twitter.com/olamy | http://linkedin.com/in/olamy
>


Re: [VOTE] Apache Helix 1.0.0 Release

2020-05-06 Thread kishore g
+1

On Wed, May 6, 2020 at 10:01 PM Hao Zhang  wrote:

> +1
>
> Congratulations on the 1.0.0 Release!
>
> —
> Best,
> Harry
>
> On Tue, May 5, 2020 at 22:07 Lei Xia  wrote:
>
> > +1
> >
> > On Tue, May 5, 2020 at 11:50 AM Wang Jiajun 
> > wrote:
> >
> > > +1
> > >
> > > Best Regards,
> > > Jiajun
> > >
> > >
> > > On Mon, May 4, 2020 at 5:42 PM Xue Junkai  wrote:
> > >
> > > > Hi,
> > > >
> > > >
> > > > This is to call for a vote on releasing the following candidate as
> > Apache
> > > > Helix 1.0.0. This is the 21st release of Helix as an Apache project,
> as
> > > > well as the [16th release as a top-level Apache project.
> > > >
> > > >
> > > > Apache Helix is a generic cluster management framework that makes it
> > easy
> > > > to build partitioned and replicated, fault-tolerant and scalable
> > > > distributed systems.
> > > >
> > > >
> > > > Release notes:
> > > >
> > > > http://helix.apache.org/1.0.0-docs/releasenotes/release-1.0.0.html
> > > >
> > > >
> > > > Release artifacts:
> > > >
> > > >
> https://repository.apache.org/content/repositories/orgapachehelix-1037
> > > >
> > > >
> > > > Distribution:
> > > >
> > > > * binaries:
> > > >
> > > > https://dist.apache.org/repos/dist/dev/helix/1.0.0/binaries/
> > > >
> > > > * sources:
> > > >
> > > > https://dist.apache.org/repos/dist/dev/helix/1.0.0/src/
> > > > 
> > > >
> > > >
> > > > The 1.0.0 release tag:
> > > >
> > > >
> > > >
> > >
> >
> https://git-wip-us.apache.org/repos/asf?p=helix.git;a=tag;h=refs/tags/helix-1.0.0
> > > >
> > > >
> > > > KEYS file available here:
> > > >
> > > > https://dist.apache.org/repos/dist/dev/helix/KEYS
> > > >
> > > >
> > > > Please vote on the release. The vote will be open for at least 72
> > hours.
> > > >
> > > >
> > > > [+1] -- "YES, release"
> > > >
> > > > [0] -- "No opinion"
> > > >
> > > > [-1] -- "NO, do not release"
> > > >
> > > >
> > > > Thanks,
> > > >
> > > > The Apache Helix Team
> > > >
> > >
> >
>


Re: [VOTE] Apache Helix 0.9.4 Release

2020-01-24 Thread kishore g
+1

On Thu, Jan 23, 2020 at 8:49 AM Xue Junkai  wrote:

> +1
>
> On Thu, Jan 23, 2020 at 5:46 AM Lei Xia  wrote:
>
> > +1
> >
> >
> > Lei
> >
> > On Wed, Jan 22, 2020 at 6:30 PM Hunter Lee  wrote:
> >
> > > It is up now.
> > >
> > > Hunter
> > >
> > > On Wed, Jan 22, 2020 at 8:48 AM Lei Xia  wrote:
> > >
> > > > Thanks Hunter, the release notes link seems not work?
> > > >
> > > >
> > > > Lei
> > > >
> > > > On Tue, Jan 21, 2020 at 11:40 PM Hunter Lee 
> > wrote:
> > > >
> > > > > Hi,
> > > > >
> > > > > This is to call for a vote on releasing the following candidate as
> > > Apache
> > > > > Helix 0.9.4. This is the 19th release of Helix as an Apache
> project,
> > as
> > > > > well as the 15th release as a top-level Apache project.
> > > > >
> > > > > Apache Helix is a generic cluster management framework that makes
> it
> > > easy
> > > > > to build partitioned and replicated, fault-tolerant and scalable
> > > > > distributed systems.
> > > > >
> > > > > Release notes:
> > > > >
> https://helix.apache.org/0.9.4-docs/releasenotes/release-0.9.4.html
> > > > >
> > > > > Release artifacts:
> > > > >
> > >
> https://repository.apache.org/content/repositories/orgapachehelix-1036/
> > > > >
> > > > > Distribution:
> > > > > * binaries:
> > > > > https://dist.apache.org/repos/dist/dev/helix/0.9.4/binaries/
> > > > > * sources:
> > > > > https://dist.apache.org/repos/dist/dev/helix/0.9.4/src/
> > > > >
> > > > > The 0.9.4 release tag:
> > > > >
> > > > >
> > > >
> > >
> >
> https://git-wip-us.apache.org/repos/asf?p=helix.git;a=tag;h=refs/tags/helix-0.9.4
> > > > >
> > > > > KEYS file available here:
> > > > > https://dist.apache.org/repos/dist/dev/helix/KEYS
> > > > >
> > > > > Please vote on the release. The vote will be open for at least 72
> > > hours.
> > > > >
> > > > > [+1] -- "YES, release"
> > > > > [0] -- "No opinion"
> > > > > [-1] -- "NO, do not release"
> > > > >
> > > > > Thanks,
> > > > > The Apache Helix Team
> > > > >
> > > >
> > >
> >
>


Re: [VOTE] Apache Helix 0.9.1 Release

2019-08-19 Thread kishore g
+1

On Wed, Aug 14, 2019 at 3:29 PM Hunter Lee  wrote:

> +1
>
> On Wed, Aug 14, 2019 at 2:07 PM Wang Jiajun 
> wrote:
>
> > Hi,
> >
> > This is to call for a vote on releasing the following candidate as Apache
> > Helix 0.9.1. This is the 18th release of Helix as an Apache project, as
> > well as the 14th release as a top-level Apache project.
> >
> > Apache Helix is a generic cluster management framework that makes it easy
> > to build partitioned and replicated, fault-tolerant and scalable
> > distributed systems.
> >
> > Release notes:
> > https://helix.apache.org/0.9.1-docs/releasenotes/release-0.9.1.html
> >
> > Release artifacts:
> > https://repository.apache.org/content/repositories/orgapachehelix-1032/
> >
> > Distribution:
> > * binaries:
> > https://dist.apache.org/repos/dist/dev/helix/0.9.1/binaries/
> > * sources:
> > https://dist.apache.org/repos/dist/dev/helix/0.9.1/src/
> >
> > The 0.9.1 release tag:
> >
> >
> https://git-wip-us.apache.org/repos/asf?p=helix.git;a=tag;h=refs/tags/helix-0.9.1
> >
> > KEYS file available here:
> > https://dist.apache.org/repos/dist/dev/helix/KEYS
> >
> > Please vote on the release. The vote will be open for at least 72 hours.
> >
> > [+1] -- "YES, release"
> > [0] -- "No opinion"
> > [-1] -- "NO, do not release"
> >
> > Thanks,
> > The Apache Helix Team
> >
>


Re: [VOTE] Apache Helix 0.9.0 Release

2019-06-12 Thread kishore g
+1.



On Wed, Jun 12, 2019 at 11:01 AM Lei Xia  wrote:

> +1
>
> On Tue, Jun 11, 2019 at 3:07 PM Hunter Lee  wrote:
>
> > Hi,
> >
> > This is to call for a vote on releasing the following candidate as Apache
> > Helix 0.9.0. This is the 17th release of Helix as an Apache project, as
> > well as the 13th release as a top-level Apache project.
> >
> > Apache Helix is a generic cluster management framework that makes it easy
> > to build partitioned and replicated, fault-tolerant and scalable
> > distributed systems.
> >
> > Release notes:
> > https://helix.apache.org/0.9.0-docs/releasenotes/release-0.9.0.html
> >
> > Release artifacts:
> > https://repository.apache.org/content/repositories/orgapachehelix-1029/
> > <
> >
> https://repository.apache.org/content/repositories/orgapachehelix-%5B%5D
> > >
> >
> > Distribution:
> > * binaries:
> > https://dist.apache.org/repos/dist/dev/helix/0.9.0/binaries/
> > 
> > * sources:
> > https://dist.apache.org/repos/dist/dev/helix/0.9.0/src/
> > 
> >
> > The [VERSION] release tag:
> >
> >
> https://git-wip-us.apache.org/repos/asf?p=helix.git;a=tag;h=refs/tags/helix-0.9.0
> > <
> >
> https://git-wip-us.apache.org/repos/asf?p=helix.git;a=tag;h=refs/tags/helix-[VERSION]
> > >
> >
> > KEYS file available here:
> > https://dist.apache.org/repos/dist/dev/helix/KEYS
> >
> > Please vote on the release. The vote will be open for at least 72 hours.
> >
> > [+1] -- "YES, release"
> > [0] -- "No opinion"
> > [-1] -- "NO, do not release"
> >
> > Thanks,
> > The Apache Helix Team
> >
>


Re: Zookeeper connection errors in Helix Controller

2019-05-31 Thread kishore g
can you grep for zookeeper state in controller log.

On Fri, May 31, 2019 at 7:52 AM DImuthu Upeksha 
wrote:

> Hi Folks,
>
> I'm getting following error in controller log and seems like controller is
> not moving froward after that point
>
> 2019-05-31 10:47:37,084 [main] INFO  o.a.a.h.i.c.HelixController  -
> Starting helix controller
> 2019-05-31 10:47:37,089 [main] INFO  o.a.a.c.u.ApplicationSettings  -
> Settings loaded from
>
> file:/home/airavata/staging-deployment/airavata-helix/apache-airavata-controller-0.18-SNAPSHOT/conf/airavata-server.properties
> 2019-05-31 10:47:37,091 [Thread-0] INFO  o.a.a.h.i.c.HelixController  -
> Connection to helix cluster : AiravataDemoCluster with name :
> helixcontroller2
> 2019-05-31 10:47:37,092 [Thread-0] INFO  o.a.a.h.i.c.HelixController  -
> Zookeeper connection string localhost:2181
> 2019-05-31 10:47:42,907 [GenericHelixController-event_process] ERROR
> o.a.h.c.GenericHelixController  - Exception while executing
> DEFAULTpipeline: org.apache.helix.controller.pipeline.Pipeline@408d6d26for
> cluster .AiravataDemoCluster. Will not continue to next pipeline
> org.apache.helix.api.exceptions.HelixMetaDataAccessException: Failed to get
> full list of /AiravataDemoCluster/CONFIGS/PARTICIPANT
> at
>
> org.apache.helix.manager.zk.ZkBaseDataAccessor.getChildren(ZkBaseDataAccessor.java:446)
> at
>
> org.apache.helix.manager.zk.ZKHelixDataAccessor.getChildValues(ZKHelixDataAccessor.java:406)
> at
>
> org.apache.helix.manager.zk.ZKHelixDataAccessor.getChildValuesMap(ZKHelixDataAccessor.java:467)
> at
>
> org.apache.helix.controller.stages.ClusterDataCache.refresh(ClusterDataCache.java:176)
> at
>
> org.apache.helix.controller.stages.ReadClusterDataStage.process(ReadClusterDataStage.java:62)
> at org.apache.helix.controller.pipeline.Pipeline.handle(Pipeline.java:63)
> at
>
> org.apache.helix.controller.GenericHelixController.handleEvent(GenericHelixController.java:432)
> at
>
> org.apache.helix.controller.GenericHelixController$ClusterEventProcessor.run(GenericHelixController.java:928)
> Caused by: org.apache.helix.api.exceptions.HelixMetaDataAccessException:
> Fail to read nodes for
> [/AiravataDemoCluster/CONFIGS/PARTICIPANT/helixparticipant]
> at
>
> org.apache.helix.manager.zk.ZkBaseDataAccessor.get(ZkBaseDataAccessor.java:414)
> at
>
> org.apache.helix.manager.zk.ZkBaseDataAccessor.getChildren(ZkBaseDataAccessor.java:479)
> at
>
> org.apache.helix.manager.zk.ZkBaseDataAccessor.getChildren(ZkBaseDataAccessor.java:442)
> ... 7 common frames omitted
>
> In the zookeeper log I can see following warning getting printed
> continuously. What could be the reason for that? I'm using helix 0.8.2 and
> zookeeper 3.4.8
>
> 2019-05-31 10:49:37,621 [myid:] - INFO  [NIOServerCxn.Factory:
> 0.0.0.0/0.0.0.0:2181:NIOServerCnxn@1008] - Closed socket connection for
> client /0:0:0:0:0:0:0:1:59056 which had sessionid 0x16b0e59877f
> 2019-05-31 10:49:37,773 [myid:] - INFO  [NIOServerCxn.Factory:
> 0.0.0.0/0.0.0.0:2181:NIOServerCnxnFactory@192] - Accepted socket
> connection
> from /127.0.0.1:57984
> 2019-05-31 10:49:37,774 [myid:] - INFO  [NIOServerCxn.Factory:
> 0.0.0.0/0.0.0.0:2181:ZooKeeperServer@893] - Client attempting to renew
> session 0x16b0e59877f at /127.0.0.1:57984
> 2019-05-31 10:49:37,774 [myid:] - INFO  [NIOServerCxn.Factory:
> 0.0.0.0/0.0.0.0:2181:ZooKeeperServer@645] - Established session
> 0x16b0e59877f with negotiated timeout 3 for client /
> 127.0.0.1:57984
> 2019-05-31 10:49:37,790 [myid:] - WARN  [NIOServerCxn.Factory:
> 0.0.0.0/0.0.0.0:2181:NIOServerCnxn@357] - caught end of stream exception
> EndOfStreamException: Unable to read additional data from client sessionid
> 0x16b0e59877f, likely client has closed socket
> at org.apache.zookeeper.server.NIOServerCnxn.doIO(NIOServerCnxn.java:230)
> at
>
> org.apache.zookeeper.server.NIOServerCnxnFactory.run(NIOServerCnxnFactory.java:203)
> at java.lang.Thread.run(Thread.java:748)
>
> Thanks
> Dimuthu
>


Re: For PMC - enabling GitHub issues and wiki

2019-05-26 Thread kishore g
I am in favor of enabling github issues and wiki

On Sun, May 26, 2019 at 11:22 AM Hunter Lee  wrote:

> Could a member of the PMC update the ticket for GitHub issues and wiki?
> This was discussed informally offline, so please mention that we do not
> have the record of it, but as long as the PMC could verify that we want
> this for Helix, the infra team should be able to go ahead and do it for us.
> https://issues.apache.org/jira/browse/INFRA-18471
>
> Thanks,
> Hunter
>


Re: Scaling participants to improve throughput of task execution

2019-04-04 Thread kishore g
It should ideally but might depend on what happens within each task. Can
you give more information about the setup (how many nodes, tasks) etc.

On Thu, Apr 4, 2019 at 2:15 PM DImuthu Upeksha 
wrote:

> Hi Folks,
>
> In task framework, it is expected to significantly improve the throughput
> of tasks executed if I add a new participant to the the cluster? Reason for
> asking for this is, I'm seeing the almost same throughput  with one
> participant and two participants. I'm using helix 0.8.4 for this setup.
>
> Thanks
> Dimuthu
>


Re: ZkHelixManager disconnection hangs

2019-04-01 Thread kishore g
This is a good catch. @Wang Jiajun  the stack trace
is good enough to fix this right. We just have to look at all the paths we
can get into this method and make sure resetHandler is thread safe and
validates the state of the zkConnection and handlers.

On Mon, Apr 1, 2019 at 12:41 PM Wang Jiajun  wrote:

> Hi Dimuthu,
>
> Did you stop the controller when the connection is flapping or when it is
> normal?
> Could you please list all the steps that you have done in order?
>
> Best Regards,
> Jiajun
>
>
> On Sat, Mar 30, 2019 at 5:54 AM DImuthu Upeksha <
> dimuthu.upeks...@gmail.com>
> wrote:
>
> > Hi Folks,
> >
> > In helix controller, we have seen below log line and by looking at the
> > code, I understood that it is due to ZkHelixManager is failing to connect
> > to zookeeper for 5 times. So I tried to stop the controller and in the
> stop
> > logic, we have a call to ZkHelixManager.disconnect() method and it
> hangs. I
> > got a thread dump and you can see where it is waiting. Can you please
> > advice as better approach to solve this?
> >
> > I noticed that ZkHelixManager disconnects [1] it self when a flapping is
> > detected. Is calling disconnect() twice the reason for that?
> >
> > 2019-03-29 15:19:56,832 [
> > ZkClient-EventThread-14-api.staging.scigap.org:2181]
> > ERROR o.a.h.m.zk.ZKHelixManager  - instanceName: helixcontroller is
> > flapping. disconnect it.  maxDisconnectThreshold: 5 disconnects in
> > 30ms.
> >
> > Thread-5 - priority:5 - threadId:0x7f5c740023f0 - nativeId:0x63f1 -
> > nativeId (decimal):25585 - state:BLOCKED
> > stackTrace:
> > java.lang.Thread.State: BLOCKED (on object monitor)
> > at
> >
> >
> org.apache.helix.manager.zk.ZKHelixManager.resetHandlers(ZKHelixManager.java:903)
> > - waiting to lock <0x0006c7e08110> (a
> > org.apache.helix.manager.zk.ZKHelixManager)
> > at
> >
> >
> org.apache.helix.manager.zk.ZKHelixManager.disconnect(ZKHelixManager.java:693)
> > at
> >
> >
> org.apache.airavata.helix.impl.controller.HelixController.disconnect(HelixController.java:103)
> > at
> >
> >
> org.apache.airavata.helix.impl.controller.HelixController$$Lambda$2/846492085.run(Unknown
> > Source)
> > at java.lang.Thread.run(Thread.java:748)
> > Locked ownable synchronizers:
> > - None
> >
> > [1]
> >
> >
> https://github.com/apache/helix/blob/helix-0.8.2/helix-core/src/main/java/org/apache/helix/manager/zk/ZKHelixManager.java#L991
> > Thanks
> > Dimuthu
> >
>


Re: Proposal: Moving Helix to Java 1.8 and upgrading Maven version

2019-03-25 Thread kishore g
+1. I don't see any issue with upgrading to 1.8.

On Sun, Mar 24, 2019 at 10:56 PM Hunter Lee  wrote:

> I would like to start a discussion on making Java 8 a minimum requirement
> and upgrading the Maven version for Helix's next feature release. I'd like
> to see how people feel about it.
>
> Did some homework on this and dug up a few precedences that are also
> top-level Apache projects dependent on ZooKeeper. The following
> documentation lists many pros of moving to Java 8 as well, many of which I
> will not include in this email for the sake of brevity (see the links
> below).
>
> Open-source community discussions for
>
> Apache Samza: link1
> 
>
> Apache Kafka: link1  link2
> 
>
> I've also had informal chats with PMC members of both Samza and Kafka
> about this specifically for more context, and from what they said, the
> transition has been very smooth.
>
> Here are Helix-specific reasons why I think the move would be beneficial:
>
> - Other Apache open-source platforms built on Helix such as Pinot and
> Gobblin all cite Java 8 as the minimum requirement. Building Helix in Java
> 8 will help contributors of Helix respond to feature/debugging requests in
> a more timely manner (without having to jump back and forth between Java 7
> and 8).
>
> - The recent change in Maven
> 
>  (Central
> Repository). Long story short, Helix build using JDK 7 on Maven 3.0.5+ will
> fail. Using JDK 8 solves this problem.
>
> The cost of moving to Java 8 is relatively low. Java 7 is forward
> compatibile with Java 8. However, there may be some backporting work needed
> due to the way Java 8 changed the ConcurrentHashMap implementation.
>
> As for Maven, Helix's requirement currently is 3.0.4 which is a version
> just below the required version other dependent Apache projects use (say,
> Pinot ). Again,
> to save the contributors the trouble of having to navigate between Maven
> versions, I am also suggesting that we update this requirement.
>
>
> -Hunter
>


Re: [VOTE] Apache Helix 0.8.4 Release

2019-03-06 Thread kishore g
+1

Looks good to me

On Thu, Feb 28, 2019 at 2:11 PM Hunter Lee  wrote:

> +1
>
> On Thu, Feb 28, 2019 at 1:49 PM Wang Jiajun 
> wrote:
>
> > +1
> >
> > Best Regards,
> > Jiajun
> >
> >
> > On Wed, Feb 27, 2019 at 3:26 PM Lei Xia  wrote:
> >
> > > +1
> > >
> > > On Wed, Feb 27, 2019 at 2:07 PM Xue Junkai  wrote:
> > >
> > > > Hi,
> > > >
> > > >
> > > > This is to call for a vote on releasing the following candidate as
> > Apache
> > > > Helix 0.8.4. This is the 16th release of Helix as an Apache project,
> as
> > > > well as the 12th release as a top-level Apache project.
> > > >
> > > >
> > > > Apache Helix is a generic cluster management framework that makes it
> > easy
> > > > to build partitioned and replicated, fault-tolerant and scalable
> > > > distributed systems.
> > > >
> > > >
> > > > Release notes:
> > > >
> > > > *https://helix.apache.org/0.8.4-docs/releasenotes/release-0.8.4.html
> > > >  >*
> > > >
> > > >
> > > > Release artifacts:
> > > >
> > > >
> https://repository.apache.org/content/repositories/orgapachehelix-1026
> > > >
> > > >
> > > > Distribution:
> > > >
> > > > * binaries:
> > > >
> > > > *https://dist.apache.org/repos/dist/dev/helix/0.8.4/binaries/
> > > > *
> > > >
> > > > * sources:
> > > >
> > > > https://dist.apache.org/repos/dist/dev/helix/0.8.4/src/
> > > >
> > > >
> > > > The 0.8.4 release tag:
> > > >
> > > >
> > > >
> > >
> >
> https://git-wip-us.apache.org/repos/asf?p=helix.git;a=tag;h=refs/tags/helix-0.8.4
> > > >
> > > >
> > > > KEYS file available here:
> > > >
> > > > https://dist.apache.org/repos/dist/dev/helix/KEYS
> > > >
> > > >
> > > > Please vote on the release. The vote will be open for at least 72
> > hours.
> > > >
> > > >
> > > > [+1] -- "YES, release"
> > > >
> > > > [0] -- "No opinion"
> > > >
> > > > [-1] -- "NO, do not release"
> > > >
> > > >
> > > > Thanks,
> > > >
> > > > The Apache Helix Team
> > > >
> > >
> >
>


Re: [NOTICE] Mandatory migration of git repos to gitbox.apache.org - one week left!

2019-02-04 Thread kishore g
+1

On Mon, Feb 4, 2019 at 2:43 PM Hao Zhang  wrote:

> +1
> On Mon, Feb 4, 2019 at 14:33 Xue Junkai  wrote:
>
> > +1
> >
> > On Mon, Feb 4, 2019 at 2:20 PM Lei Xia  wrote:
> >
> > > +1
> > >
> > > On Wed, Jan 30, 2019 at 11:28 AM Wang Jiajun 
> > > wrote:
> > >
> > > > +1
> > > >
> > > > Best Regards,
> > > > Jiajun
> > > >
> > > >
> > > > On Wed, Jan 30, 2019 at 12:34 AM Xue Junkai  wrote:
> > > >
> > > > > Thanks Tommaso! Shall we have have a separate voting and link it?
> Or
> > > this
> > > > > email thread is fair enough?
> > > > >
> > > > > Best,
> > > > >
> > > > > Junkai
> > > > >
> > > > > On Wed, Jan 30, 2019 at 12:30 AM Tommaso Teofili <
> > > > > tommaso.teof...@gmail.com>
> > > > > wrote:
> > > > >
> > > > > > I think we should cast a vote, isn't it?
> > > > > >
> > > > > > Regards,
> > > > > > Tommaso
> > > > > >
> > > > > > Il giorno mer 30 gen 2019 alle ore 09:27 Xue Junkai <
> > j...@apache.org
> > > >
> > > > > > ha scritto:
> > > > > > >
> > > > > > > Thanks for the reminder. We are agree to move the gitbox.
> > > > > > >
> > > > > > > Best,
> > > > > > >
> > > > > > > Junkai
> > > > > > >
> > > > > > > On Wed, Jan 30, 2019 at 12:10 AM Apache Infrastructure Team <
> > > > > > infrastruct...@apache.org> wrote:
> > > > > > >>
> > > > > > >> Hello again, helix folks.
> > > > > > >> This is a reminder that you have *one week left* before the
> > > > mandatory
> > > > > > >> mass-migration from git-wip-us to gitbox.
> > > > > > >>
> > > > > > >> As stated earlier in 2018, and reiterated a few times, all git
> > > > > > >> repositories must be migrated from the git-wip-us.apache.org
> > URL
> > > to
> > > > > > >> gitbox.apache.org, as the old service is being
> decommissioned.
> > > Your
> > > > > > >> project is receiving this email because you still have
> > > repositories
> > > > on
> > > > > > >> git-wip-us that needs to be migrated.
> > > > > > >>
> > > > > > >> The following repositories on git-wip-us belong to your
> project:
> > > > > > >>  - helix.git
> > > > > > >>
> > > > > > >>
> > > > > > >> We are now entering the remaining one week of the mandated
> > > > > > >> (coordinated) move stage of the roadmap, and you are asked to
> > > please
> > > > > > >> coordinate migration with the Apache Infrastructure Team
> before
> > > > > February
> > > > > > >> 7th. All repositories not migrated on February 7th will be
> mass
> > > > > migrated
> > > > > > >> without warning, and we'd appreciate it if we could work
> > together
> > > to
> > > > > > >> avoid a big mess that day :-).
> > > > > > >>
> > > > > > >> As stated earlier, moving to gitbox means you will get full
> > write
> > > > > access
> > > > > > >> on GitHub as well, and be able to close/merge pull requests
> and
> > > much
> > > > > > >> more. The move is mandatory for all Apache projects using git.
> > > > > > >>
> > > > > > >> To have your repositories moved, please follow these steps:
> > > > > > >>
> > > > > > >> - Ensure consensus on the move (a link to a lists.apache.org
> > > thread
> > > > > > will
> > > > > > >>   suffice for us as evidence).
> > > > > > >> - Create a JIRA ticket at
> > > > https://issues.apache.org/jira/browse/INFRA
> > > > > > >>
> > > > > > >> Your migration should only take a few minutes. If you wish to
> > > > migrate
> > > > > > >> at a specific time of day or date, please do let us know in
> the
> > > > > ticket,
> > > > > > >> otherwise we will migrate at the earliest convenient time.
> > > > > > >>
> > > > > > >> There will be redirects in place from git-wip to gitbox, so
> > > requests
> > > > > > >> using the old remote origins should still work (however we
> > > encourage
> > > > > > >> people to update their remotes once migration has completed).
> > > > > > >>
> > > > > > >> As always, we appreciate your understanding and patience as we
> > > move
> > > > > > >> things around and work to provide better services and features
> > for
> > > > > > >> the Apache Family.
> > > > > > >>
> > > > > > >> Should you wish to contact us with feedback or questions,
> please
> > > do
> > > > so
> > > > > > >> at: us...@infra.apache.org.
> > > > > > >>
> > > > > > >>
> > > > > > >> With regards,
> > > > > > >> Apache Infrastructure
> > > > > > >>
> > > > > >
> > > > >
> > > >
> > > --
> > > Lei Xia
> > >
> >
> >
> > --
> > Junkai Xue
> >
>


Re: [VOTE] Apache Helix 0.8.3 Release

2019-02-04 Thread kishore g
+1

On Mon, Feb 4, 2019 at 8:07 PM Lei Xia  wrote:

> +1
>
> On Mon, Feb 4, 2019 at 6:56 PM Wang Jiajun  wrote:
>
> > +1
> >
> > On Feb 4, 2019 2:43 PM, "Hao Zhang"  wrote:
> >
> > > +1
> > > On Mon, Feb 4, 2019 at 14:23 Xue Junkai  wrote:
> > >
> > > > Hi,
> > > >
> > > >
> > > > This is to call for a vote on releasing the following candidate as
> > Apache
> > > > Helix 0.8.3. This is the 15th release of Helix as an Apache project,
> as
> > > > well as the 11th release as a top-level Apache project.
> > > >
> > > >
> > > > Apache Helix is a generic cluster management framework that makes it
> > easy
> > > > to build partitioned and replicated, fault-tolerant and scalable
> > > > distributed systems.
> > > >
> > > >
> > > > Release notes:
> > > >
> > > > *https://helix.apache.org/0.8.3-docs/releasenotes/release-0.8.3.html
> > > >  >*
> > > >
> > > >
> > > > Release artifacts:
> > > >
> > > >
> https://repository.apache.org/content/repositories/orgapachehelix-1022
> > > >
> > > >
> > > > Distribution:
> > > >
> > > > * binaries:
> > > >
> > > > *https://dist.apache.org/repos/dist/dev/helix/0.8.3/binaries/
> > > > *
> > > >
> > > > * sources:
> > > >
> > > > https://dist.apache.org/repos/dist/dev/helix/0.8.3/src/
> > > >
> > > >
> > > > The 0.8.3 release tag:
> > > >
> > > >
> > > > https://git-wip-us.apache.org/repos/asf?p=helix.git;a=tag;h=
> > > refs/tags/helix-
> > > > 0.8.3
> > > >
> > > >
> > > > KEYS file available here:
> > > >
> > > > https://dist.apache.org/repos/dist/dev/helix/KEYS
> > > >
> > > >
> > > > Please vote on the release. The vote will be open for at least 72
> > hours.
> > > >
> > > >
> > > > [+1] -- "YES, release"
> > > >
> > > > [0] -- "No opinion"
> > > >
> > > > [-1] -- "NO, do not release"
> > > >
> > > >
> > > > Thanks,
> > > >
> > > > The Apache Helix Team
> > > >
> > >
> >
>
>
> --
> Lei Xia
>


Re: Regarding Helix releasing 0.8.3

2019-01-11 Thread kishore g
Thanks a lot. Will look into it

On Fri, Jan 11, 2019 at 6:18 PM Wang Jiajun  wrote:

> Hi Kishore,
>
> I have sent a pull request to fix the first 2 issues.
> https://github.com/apache/helix/pull/297
> As for the 3rd one, it requires a much larger scope of change. And
> actually, it does not break any logic now after we fixed the ephemeral node
> owner validate logic. We think it can be scheduled for future release.
>
> Best Regards,
> Jiajun
>
>
> On Mon, Jan 7, 2019 at 3:57 PM Wang Jiajun  wrote:
>
>> Resending. Reply to all.
>>
>> We can probably fix the first 2 issues within 2 weeks, considering the
>> additional test and validation required.
>> For issue 1, we can make the original reset into 2 methods. For new
>> session handling, we should not interrupt. For client closing, we shall
>> interrupt thread and shut down.
>> For issue 2, we need to try catch for zookeeper NPE in addition.
>>
>> Issue 3 will take more time since we need to change both ZkClient and
>> event handler. There may be some interfaces need to be updated. Moreover,
>> it changes the current ZkClient behavior. So we'd better run it in the test
>> environment for a longer time.
>>
>> With the ephemeral node's owner fixed, the 3rd issue does not impact
>> correctness. So maybe we can plan for fixing the first 2 issues first? And
>> then plan for the 3rd issue in the next release? If that's the case, we
>> shall have a release candidate after 2 weeks.
>>
>> Best Regards,
>> Jiajun
>>
>>
>> On Mon, Jan 7, 2019 at 3:14 PM kishore g  wrote:
>>
>>> I think the pending issues are the ones that are affecting us. What does
>>> it take to fix those issues?
>>>
>>> On Mon, Jan 7, 2019 at 2:54 PM Wang Jiajun 
>>> wrote:
>>>
>>>> Hi Kishore,
>>>>
>>>> Hope you are doing well.
>>>> Since last time we met to discuss potential ZkClient improvements in
>>>> Helix, we have completed the fix of one issue. However, the resolving of
>>>> the whole list will take more time, given Pinot is still waiting for the
>>>> new release, I'd like to hear your opinion that whether we shall release
>>>> 0.8.3 based on the current situation.
>>>>
>>>> Fixed issues:
>>>>
>>>>1. For an Ephemeral node, the source of truth should be the owner
>>>>session Id instead of the node content.
>>>>This fixes the leader election issue we found in Pinot cluster.
>>>>
>>>> Pending issues:
>>>>
>>>>1. ZkClient should not interrupt the callback handling during
>>>>session reestablishment or other reset logic. Interrupt for shutdown 
>>>> should
>>>>only happen when things are closed. For fixing this problem, we need to
>>>>think about how to handle thread leaking.
>>>>2. ZkConnection.getZookeeper() == null potentially cause
>>>>retryUntilConnect to terminate earlier than expected. Should keep 
>>>> waiting
>>>>for this error.
>>>>3. The ZkClient event should keep a session Id. The event processor
>>>>can discard expired event.
>>>>
>>>> Best Regards,
>>>> Jiajun
>>>>
>>>


Re: Regarding Helix releasing 0.8.3

2019-01-07 Thread kishore g
I think the pending issues are the ones that are affecting us. What does it
take to fix those issues?

On Mon, Jan 7, 2019 at 2:54 PM Wang Jiajun  wrote:

> Hi Kishore,
>
> Hope you are doing well.
> Since last time we met to discuss potential ZkClient improvements in
> Helix, we have completed the fix of one issue. However, the resolving of
> the whole list will take more time, given Pinot is still waiting for the
> new release, I'd like to hear your opinion that whether we shall release
> 0.8.3 based on the current situation.
>
> Fixed issues:
>
>1. For an Ephemeral node, the source of truth should be the owner
>session Id instead of the node content.
>This fixes the leader election issue we found in Pinot cluster.
>
> Pending issues:
>
>1. ZkClient should not interrupt the callback handling during session
>reestablishment or other reset logic. Interrupt for shutdown should only
>happen when things are closed. For fixing this problem, we need to think
>about how to handle thread leaking.
>2. ZkConnection.getZookeeper() == null potentially cause
>retryUntilConnect to terminate earlier than expected. Should keep waiting
>for this error.
>3. The ZkClient event should keep a session Id. The event processor
>can discard expired event.
>
> Best Regards,
> Jiajun
>


Re: [GitHub] helix pull request #266: Propose design for aggregated cluster view service

2018-10-23 Thread kishore g
What you are proposing is no different from an Observer (without
consistency guarantees) etc. Initially, this might look simple but once we
start handling all the edge cases, it will start looking more like Observer.

“Observers forward these requests to the Leader like Followers do, but they
then simply wait to hear the result of the vote” The documentation is
referring to the writes sent to observers. The use case we are trying to
address involves only reading. Observers do not forward the read requests
to leaders.



On Tue, Oct 23, 2018 at 5:16 PM Hao Zhang  wrote:

> I agree that it is undoubtedly true that using native ZooKeeper observer
> has big advantage such that it also provides clients with ordered events,
> but our use cases (i.e. administration, federation, or Ambry’s data
> replication and serving requests from remote) are just not latency
> sensitive and therefore having strict event sequence enforcement is likely
> to be an overkill.
>
> In addition, according to zookeeper’s official documentation, “Observers
> forward these requests to the Leader like Followers do, but they then
> simply wait to hear the result of the vote”. Since observers are just
> proxy-ing requests, it cannot actually resolve our real pain point - the
> massive cross data center traffic generated when all storage nodes and
> routers in Ambry cluster need to know information from all data centers,
> which makes it worthwhile to build a customized view aggregation service to
> “cache” information locally.
>
>
> —
> Best,
> Harry
> On Tue, Oct 23, 2018 at 16:24 kishore g  wrote:
>
> > It's better to use observers since the replication is timeline consistent
> > i.e changes are seen in the same order as they happened on the
> originating
> > cluster. Achieving correctness is easier with observer model. I agree
> that
> > we might have to replicate changes we don't care but changes to ZK are
> > multiple orders of magnitude smaller than replicating a database.
> >
> > You can still have the aggregation logic as part of the client library.
> >
> >
> > On Tue, Oct 23, 2018 at 2:02 PM zhan849  wrote:
> >
> > > Github user zhan849 commented on a diff in the pull request:
> > >
> > > https://github.com/apache/helix/pull/266#discussion_r227562948
> > >
> > > --- Diff: designs/aggregated-cluster-view/design.md ---
> > > @@ -0,0 +1,353 @@
> > > +Aggregated Cluster View Design
> > > +==
> > > +
> > > +## Introduction
> > > +Currently Helix organize information by cluster - clusters are
> > > autonomous entities that holds resource / node information.
> > > +In real practice, a helix client might need to access aggregated
> > > information of helix clusters from different data center regions for
> > > management or coordination purpose.
> > > +This design proposes a service in Helix ecosystem for clients to
> > > retrieve cross-datacenter information in a more efficient way.
> > > +
> > > +
> > > +## Problem Statement
> > > +We identified a couple of use cases for accessing cross datacenter
> > > information. [Ambry](https://github.com/linkedin/ambry) is one of
> them.
> > > +Here is a simplified example: some service has Helix cluster
> > > "MyDBCluster" in 3 data centers respectively, and each cluster has a
> > > resource named "MyDB".
> > > +To federate this "MyDBCluster", current usage is to have each
> > > federation client (usually Helix spectator) to connect to metadata
> store
> > > endpoints in all fabrics to retrieve information and aggregate them
> > locally.
> > > +Such usge has the following drawbacks:
> > > +
> > > +* As there are a lot of clients in each DC that need cross-dc
> > > information, there are a lot of expensive cross-dc traffics
> > > +* Every client needs to know information about metadata stores in
> > all
> > > fabrics which
> > > +  * Increases operational cost when these information changes
> > > +  * Increases security concern by allowing cross data center
> traffic
> > > +
> > > +To solve the problem, we have the following requirements:
> > > +* Clients should still be able to GET/WATCH aggregated information
> > > from 1 or more metadata stores (likely but not necessarily from
> different
> > > data centers)
> > > +* Cross DC traffic should be minimized
> > > +* Reduce amount of information abou

Re: Cloning a Helix Workflow

2018-09-12 Thread kishore g
Lei, do you know if there is a way to restart the workflow?

On Wed, Sep 12, 2018 at 10:07 AM DImuthu Upeksha 
wrote:

> Any update on this ?
>
> On Wed, Apr 4, 2018 at 9:10 AM DImuthu Upeksha  >
> wrote:
>
> > Hi Folks,
> >
> > I'm running 50 -100 Helix Task Workflows at a time and due to some
> > unexpected issues, some workflows go into the failed state. Is there a
> way
> > I can retry those workflows from the beginning or clone new workflows
> from
> > them and run as fresh workflows?
> >
> > Thanks
> > Dimuthu
> >
>


Re: [VOTE] Apache Helix 0.8.1 Release

2018-04-26 Thread kishore g
+1

On Thu, Apr 26, 2018 at 11:57 AM, Vivo Xu  wrote:

> +1
> On Thu, Apr 26, 2018 at 11:34 AM Lei Xia  wrote:
>
> > +1
> >
> >
> >
> > Lei Xia
> >
> > 
> > From: Eric Kim 
> > Sent: Thursday, April 26, 2018 10:47:58 AM
> > To: dev@helix.apache.org
> > Subject: [VOTE] Apache Helix 0.8.1 Release
> >
> > Hi,
> >
> >
> >
> > This is to call for a vote on releasing the following candidate as Apache
> > Helix 0.8.1. This is the thirteenth release of Helix as an Apache
> project,
> > as well as the ninth release as a top-level Apache project.
> >
> >
> >
> > Apache Helix is a generic cluster management framework that makes it easy
> > to build partitioned and replicated, fault-tolerant and scalable
> > distributed systems.
> >
> >
> >
> > Release notes:
> >
> > https://helix.apache.org/0.8.1-docs/releasenotes/release-0.8.1.html
> >
> >
> >
> > Release artifacts:
> >
> > https://repository.apache.org/content/repositories/orgapachehelix-1016
> >
> >
> >
> > Distribution:
> >
> > * binaries:
> >
> > https://dist.apache.org/repos/dist/dev/helix/0.8.1/binaries/
> >
> > * sources:
> >
> > https://dist.apache.org/repos/dist/dev/helix/0.8.1/src/
> >
> >
> >
> > The 0.8.1 release tag:
> >
> >
> > https://git-wip-us.apache.org/repos/asf?p=helix.git;a=tag;h=
> refs/tags/helix-0.8.1
> >
> >
> >
> > KEYS file available here:
> >
> > https://dist.apache.org/repos/dist/dev/helix/KEYS
> >
> >
> >
> > Please vote on the release. The vote will be open for at least 72 hours.
> >
> >
> >
> > [+1] -- "YES, release"
> >
> > [0] -- "No opinion"
> >
> > [-1] -- "NO, do not release"
> >
> >
> >
> > Thanks,
> >
> > The Apache Helix Team
> >
> >
>


Re: IRC link is broken

2018-04-03 Thread kishore g
Hi Rakesh,

We have stopped using the IRC and will remove that link. You can send the
questions to dev@helix.apache.org.

thanks,
Kishore G

On Tue, Apr 3, 2018 at 8:49 AM, Rakesh Kumar <rakeshcu...@gmail.com> wrote:

> Hi,
>
> This is my first email to this group so pardon my ignorance.
>
> I am trying IRC link (https://helix.apache.org/IRC.html) but it seems the
> information given on this page is not correct. I tried to add
> chat.freenode.net server on my IRC client but it refused the connection. I
> tried to ping this server but couldn't get any ping response.
> Can anyone provide me correct IRC information so I can connect my IRC
> client?
> Let me know if I am missing something.
>
> Thank you,
> Rakesh Kumar
>


Re: [VOTE] Apache Helix 0.8.0 Release

2018-01-29 Thread kishore g
+1.

On Mon, Jan 29, 2018 at 6:13 PM, Bo Liu  wrote:

> +1 can't wait to try V2 restful API and UI.
>
> On Mon, Jan 29, 2018 at 5:53 PM, Lei Xia  wrote:
>
> > +1
> >
> > On Mon, Jan 29, 2018 at 5:13 PM, Xue Junkai  wrote:
> >
> > > Hi,
> > >
> > > This is to call for a vote on releasing the following candidate as
> Apache
> > > Helix 0.8.0. This is the 13th release of Helix as an Apache project, as
> > > well as the 9th release as a top-level Apache project.
> > >
> > > Apache Helix is a generic cluster management framework that makes it
> easy
> > > to build partitioned and replicated, fault-tolerant and scalable
> > > distributed systems.
> > >
> > > Release notes:
> > > *http://helix.apache.org/0.8.0-docs/releasenotes/release-0.8.0.html
> > > *
> > >
> > > Release artifacts:
> > > https://repository.apache.org/content/repositories/orgapachehelix-1013
> > >
> > > Distribution:
> > > * binaries:
> > > https://dist.apache.org/repos/dist/dev/helix/0.8.0/binaries/
> > > * sources:
> > > https://dist.apache.org/repos/dist/dev/helix/0.8.0/src/
> > >
> > > The 0.8.0 release tag:
> > > https://git-wip-us.apache.org/repos/asf?p=helix.git;a=tag;h=
> > > refs/tags/helix-0.8.0
> > >
> > > KEYS file available here:
> > > https://dist.apache.org/repos/dist/dev/helix/KEYS
> > >
> > > Please vote on the release. The vote will be open for at least 72
> hours.
> > >
> > > [+1] -- "YES, release"
> > > [0] -- "No opinion"
> > > [-1] -- "NO, do not release"
> > >
> > > Thanks,
> > > The Apache Helix Team
> > >
> >
>
>
>
> --
> Best regards,
> Bo
>


Re: Helix-UI status

2018-01-24 Thread kishore g
I think the google source is just a mirror, you can ignore that.

This was something done by Greg Brandt. We did not have the bandwidth to
bring it to master as we did not deploy it in production. Feel free to take
this as the base and make it work in production and contribute it back to
Helix.

Also, take a look at helix-front. This is the ui currently used at LinkedIn
and is tested in production. It might be a good idea to bring in the
helix-ui code into helix-front.

On Wed, Jan 24, 2018 at 12:07 PM, Yulun Li  wrote:

> We are building tools for our Helix application and we are considering
> helix
> -ui. However, we found a few links, such as
>
> https://apache.googlesource.com/helix/+/master/helix-ui/
> https://github.com/apache/helix/tree/helix-0.7.x/helix-ui
>
> A few questions:
>
>1. Can we just ignore the repo on Google Source?
>2. Are there risks of using Helix-UI with Helix 0.6.8 deployment?
>3. What's the current status and feature roadmap for Helix UI?
>
> Thanks.
>


Re: Using job specific Participants in Helix

2017-11-16 Thread kishore g
Hi Dlmuthu,

You can achieve this by using the tag feature in Helix. For example, to
Participants P1, P2 should only handle DataCollecting tasks, you need to do
the following

   - Tag Participant P1 and P2 as "DataCollecting"
   - When you create a job resource that represents DataCollecting task,
   you have to tag the resource also as DataCollecting.

Helix will then assign all tasks in this job to nodes that are tagged as
DataCollecting.

Note this is available only in FULL_AUTO rebalance mode. I am not sure if
the rebalancer in task framework supports this feature.

Lei/Junkai, do you know if this is supported in Task Framework?

thanks


On Thu, Nov 16, 2017 at 8:31 PM, Ajinkya Dhamnaskar 
wrote:

> Dlmuthu,
>
> This explains your need. Could you please point me to the code? I was
> wondering, where and how are we registering callbacks for the participants?
>
> On Thu, Nov 16, 2017 at 7:16 PM, DImuthu Upeksha <
> dimuthu.upeks...@gmail.com
> > wrote:
>
> > Adding Helix Dev
> >
> > Hi Ajinkya
> >
> > Thank you for the explanation.
> >
> > Let me explain my requirement. [1] is the correct task registry
> > configuration in a working participant code. All the transition call
> backs
> > are registered in this code. The problem here is, we have to bundle
> > binaries of all the tasks into a single Participant. If we want to change
> > one task, we need to rebuild the participant with other tasks as well.
> What
> > I thought is, why can't we build Participants that only do specific set
> of
> > tasks. For example, in the cluster there are 3 participants, Participant
> 1
> > [2], Participant 2 [3] and Participant 3 [4]. If we want to change the
> > DataCollecting task, we only need to re build Participant 2. I got above
> > error, once I run 3 participants in above configuration. I can understand
> > this is due to the missing transition callback of CommandTask in
> > Participant 1 as I have purposely commented out it. What I need to know
> is
> > that, is this configuration allowed in Helix architecture or do we have
> to
> > implement Participants that contain all the task implementations as in
> [1].
> > Workflow configuration for both scenarios is here [5].
> >
> > [1] https://gist.github.com/DImuthuUpe/41c0db579e7d86d101d112f07ed6ea00
> > [2] https://gist.github.com/DImuthuUpe/ec72df1ec3207ce2dce88ff7f1756da4
> > [3] https://gist.github.com/DImuthuUpe/4d193a3dff3008315efa2e31f6209cac
> > [4] https://gist.github.com/DImuthuUpe/872c432935b8d33944dd571b3ac4207b
> > [5] https://gist.github.com/DImuthuUpe/f61851b68b685b8d6744689dc130babd
> >
> > Thanks
> > Dimuthu
> >
> > On Fri, Nov 17, 2017 at 3:21 AM, Ajinkya Dhamnaskar <
> adham...@umail.iu.edu
> > > wrote:
> >
> >> Hey Dlmuthu,
> >>
> >> Not an expert in Helix, but from exceptions it seems, system is entering
> >> in a state not expected by reflection. I feel https://github.com/apache
> >> /helix/blob/master/helix-core/src/main/java/org/apache/helix
> >> /messaging/handling/HelixStateTransitionHandler.java#L295 is triggering
> >> this exception.
> >> As mentioned in the later part of the stack trace and from Helix Apache
> >> Docs  *("Helix
> >> is built on the following assumption: if your distributed resource is
> >> modeled by a finite state machine, then Helix can tell participants when
> >> they should transition between states. In the Java API, this means
> >> implementing transition callbacks. In the Helix agent API, this means
> >> providing commands than can run for each transition"),* did you
> >> implement *transition callback* for these tasks?
> >>
> >> On Thu, Nov 16, 2017 at 10:01 AM, DImuthu Upeksha <
> >> dimuthu.upeks...@gmail.com> wrote:
> >>
> >>> Hi Devs,
> >>>
> >>> I'm working on the technology evaluation to re architecture Apache
> >>> Airavata task execution framework and Helix seems like a good
> candidate for
> >>> that as it has an in built distributed generic workflow execution
> >>> capability. After going through several tutorials, I tried to
> implement a
> >>> simple workflow on Helix to demonstrate following transition
> >>>
> >>> Data Collecting Job -> Command Executing Job -> Data Uploading Job
> >>>
> >>> I managed to implement this using a Participant node that includes all
> >>> the tasks required for above workflow. However my goal is to implement
> >>> specialized Participants for each Job type. For example, Participant 1
> >>> knows only about the tasks to perform Data Collecting Job and
> Participant
> >>> knows only about the task to perform Command Executing Job. When I
> tried to
> >>> implement such Participants, I got following error from Participant 1.
> I
> >>> can share the code samples that I have tried but before that I need to
> know
> >>> whether my approach is compatible with Helix's design? Does Helix
> require
> >>> all the Participants to be homogeneous?
> >>>
> >>> Executing data collecting
> >>> 675892 

Re: External view change notifications to clients

2017-11-11 Thread kishore g
Yes that’s a good idea to read external view when server returns error.
Add exponential back off policy here. Make sure the server returns proper
error code indicating it’s no longer the master. You don’t want to read
external view on every error.



On Fri, Nov 10, 2017 at 6:11 PM leela maheswararao <
leela_mahesh...@yahoo.com> wrote:

> Thanks Kishore and Xue for quick reply.
>
> Basically my scenario is like this.
>
> Initially assume partition P1(Master) is on node N1. As part of new node
> addition, P1 (M) moved to N2. It's possible that client C1
> still sees P1(M) on N1 due to late processing of notification and client
> C2 sees P1(M) on N2 due to immediate processing of notification.
>
> Agree that getting consensus is bit tough. One way our application client
> can solve this is getting an error from server when it receives PUT/GET for
> partition it doesn't own and client call below API to make an explicit
> Zookeeper call to get current state and invoke operation on right server.
>
> manager.getClusterManagmentTool().getResourceExternalView(clusterName,
> resourceName);
>
> Im assuming above API makes an explicit call to ZK.
>
> do you see any other alternative solution?
>
> Regards,
> Mahesh
> On Friday, November 10, 2017, 9:49:57 PM GMT+5:30, kishore g <
> g.kish...@gmail.com> wrote:
>
>
> lets break it into two parts. Update to ExternalView is done by the
> controller. The clients are notified of the change through a Zookeeper
> callback. Its not guaranteed that all clients will receive the callback at
> the same time.
>
> In general, its not a good idea to rely on every client seeing the same
> view at the same time (its impossible to achieve this in a distributed
> system). However, it is the view is timeline consistent. For e.g if
> controller changes the external view to EV(t1), EV(t2), EV(t3) the clients
> will get notified in the same order. Another thing you should be aware is
> that if these changes happen in quick succession, its possible that the
> client only sees EV(t1) and EV(t3).
>
> Can you provide more details on what you are planning to achieve. We can
> suggest the right design.
>
>
> On Fri, Nov 10, 2017 at 4:05 AM, Xue Junkai <junkai@gmail.com> wrote:
>
> > If you attach the external view listener through HelixManager, all the
> > listener will be notified at same time.
> >
> > On Fri, Nov 10, 2017 at 4:02 AM, leela maheswararao <
> > leela_mahesh...@yahoo.com.invalid> wrote:
> >
> > > Team,Does helix ensure whether all clients see same external view at
> same
> > > time ? Or application should handle this?
> > >
> > > Regards,Mahesh
> >
> >
> >
> >
> > --
> > Junkai Xue
> >
>


Re: External view change notifications to clients

2017-11-10 Thread kishore g
lets break it into two parts. Update to ExternalView is done by the
controller. The clients are notified of the change through a Zookeeper
callback. Its not guaranteed that all clients will receive the callback at
the same time.

In general, its not a good idea to rely on every client seeing the same
view at the same time (its impossible to achieve this in a distributed
system). However, it is the view is timeline consistent. For e.g if
controller changes the external view to EV(t1), EV(t2), EV(t3) the clients
will get notified in the same order. Another thing you should be aware is
that if these changes happen in quick succession, its possible that the
client only sees EV(t1) and EV(t3).

Can you provide more details on what you are planning to achieve. We can
suggest the right design.


On Fri, Nov 10, 2017 at 4:05 AM, Xue Junkai  wrote:

> If you attach the external view listener through HelixManager, all the
> listener will be notified at same time.
>
> On Fri, Nov 10, 2017 at 4:02 AM, leela maheswararao <
> leela_mahesh...@yahoo.com.invalid> wrote:
>
> > Team,Does helix ensure whether all clients see same external view at same
> > time ? Or application should handle this?
> >
> > Regards,Mahesh
>
>
>
>
> --
> Junkai Xue
>


Documentation of setting up controller as service

2017-08-09 Thread kishore g
https://stackoverflow.com/questions/45406255/how-to-setup-apache-helix-controller-in-controller-as-a-service-mode

Does anyone know the steps to set up the controller as a service? Let's
update the documentation and also answer the question on stack over flow.


Re: Apache Helix Docker

2017-08-09 Thread kishore g
+ helix dev

I remember some one from Turn did this. It should not be hard to generate a
docker image for Helix. Note, we can only generate docker image for Helix
Controller. Participants and Spectators generally embed Helix as library.

If you can provide more info about your deployment strategy, we will guide
you further.

thanks,
Kishore G

On Mon, Aug 7, 2017 at 10:09 AM, Meenashisundaram, Rathinaganesh <
rathinaganesh.meenashisunda...@verizonwireless.com> wrote:

> Greetings,
>
>
>
> I am working with Helix to get deployed in our Verizon production
> environment and wanted to dockerize it.
>
>
>
> However, I did not find any docker image from Helix and was wondering if
> it is not a recommended practice?
>
>
>
> Is there anyone I can talk to about it? Or Can you please point me in the
> right direction.
>
>
>
> Thanks,
>
> Ganesh
>
>
>
>
>
>
>


Re: Latest Helix blog

2017-07-27 Thread kishore g
Nice article. Yes, we should add the publications to the home page. Also,
provide a link to Powered By Helix page.

On Thu, Jul 27, 2017 at 9:47 AM, Lei Xia  wrote:

> Hi, All
>
>   This is the latest Helix blog we have published,
> https://engineering.linkedin.com/blog/2017/07/powering-
> helix_s-auto-rebalancer-with-topology-aware-partition-p,
> please feel free to share it.
>
>   Maybe we should add a page to our Apache website to include all of the
> publications, talks and blogs about Helix?  If no one objects it, I will go
> ahead and add such page.
>
>
> Thanks
> Lei
>


Re: Branches rearrangement

2017-06-22 Thread kishore g
Thanks, Junkai. can we first separate the api's into a separate
helix-api module before we make any release on master. We can also start
with a new major version and start versioning from 2.0.0

thanks
Kishore G,




On Thu, Jun 22, 2017 at 4:28 PM, Lei Xia <l...@apache.org> wrote:

> Thanks Junkai for working on this!
>
>
> Lei
>
> On Thu, Jun 22, 2017 at 4:23 PM, Xue Junkai <j...@apache.org> wrote:
>
> > Hi All,
> >
> > Branch switches are done now! Now the development work will be started in
> > master branch, which is forked from helix-0.6.x. Please update your
> forked
> > repo or refork it if it is necessary.
> >
> > Feel free to let us know if you have any questions or concerns!
> >
> > Best,
> >
> > Junkai
> >
> > On Mon, Jun 19, 2017 at 4:33 PM, Xue Junkai <j...@apache.org> wrote:
> >
> > > Hi All,
> > >
> > > Here're some heads ups for branch rearrangements:
> > >
> > >1. There will not be any features checked into helix-0.6.x, except
> > >some hot fixes.
> > >2. Master branch will be moved to helix-0.7.x for 0.7 version bug
> > >fixing.
> > >3. Move helix-0.6.x branch to Master branch and try our best to make
> > >it compatible with helix-0.7.x
> > >
> > > Please let us know if you have any suggestions or questions regarding
> > this!
> > >
> > > Best,
> > >
> > > Junkai
> > >
> >
>


Re: [VOTE] Apache Helix 0.6.8 Release

2017-06-12 Thread kishore g
+1. Thanks for driving this Junkai.

On Mon, Jun 12, 2017 at 7:55 PM, Lei Xia  wrote:

> +1
>
> On Mon, Jun 12, 2017 at 4:44 PM, Xue Junkai  wrote:
>
> > Hi,
> >
> > This is to call for a vote on releasing the following candidate as Apache
> > Helix 0.6.8. This is the eleventh release of Helix as an Apache project,
> as
> > well as the seventh release as a top-level Apache project.
> >
> > Apache Helix is a generic cluster management framework that makes it easy
> > to build partitioned and replicated, fault-tolerant and scalable
> > distributed systems.
> >
> > Release notes:
> > http://helix.apache.org/0.6.8-docs/releasenotes/release-0.6.8.html
> >
> > Release artifacts:
> > https://repository.apache.org/content/repositories/orgapachehelix-1010
> >
> > Distribution:
> > * binaries:
> > https://dist.apache.org/repos/dist/dev/helix/0.6.8/binaries/
> > * sources:
> > https://dist.apache.org/repos/dist/dev/helix/0.6.8/src/
> >
> > The [VERSION] release tag:
> > https://git-wip-us.apache.org/repos/asf?p=helix.git;a=tag;h=
> > refs/tags/helix-0.6.8
> >
> > KEYS file available here:
> > https://dist.apache.org/repos/dist/dev/helix/KEYS
> >
> > Please vote on the release. The vote will be open for at least 72 hours.
> >
> > [+1] -- "YES, release"
> > [0] -- "No opinion"
> > [-1] -- "NO, do not release"
> >
> > Thanks,
> > The Apache Helix Team
> >
>
>
>
> --
> Lei Xia
>


Re: Generate Helix release 0.6.8

2017-05-10 Thread kishore g
Yes. Do you have a PR for that?. I can review it

On Wed, May 10, 2017 at 11:19 AM, Xue Junkai <junkai@gmail.com> wrote:

> Sure! Please let me know if this change works or not. BTW will customized
> batch message threadpool be involved in this release?
>
> Best,
>
> Junkai
>
> On Tue, May 9, 2017 at 7:28 PM, kishore g <g.kish...@gmail.com> wrote:
>
> > I would like to have that fix included for Pinot. I will test it the
> patch.
> >
> > On Tue, May 9, 2017 at 5:59 PM, Xue Junkai <junkai@gmail.com> wrote:
> >
> > > It does contain the batchMessage thread pool fix. But for race
> condition
> > > fix I withdraw the pull request since I am not quite sure whether the
> fix
> > > works or not. In addition, this release will include the
> > > AutoRebalanceStrategy not assign replicas fix.
> > >
> > >
> > > Best,
> > >
> > > Junkai
> > >
> > > On Tue, May 9, 2017 at 5:49 PM, kishore g <g.kish...@gmail.com> wrote:
> > >
> > > > Does this include the batchMessage thread pool fix and fix to the
> race
> > > > condition
> > > >
> > > > On Tue, May 9, 2017 at 5:08 PM, Xue Junkai <junkai@gmail.com>
> > wrote:
> > > >
> > > > > Hi Helix Devs,
> > > > >
> > > > > I am going to work on releasing Helix 0.6.8 this week. Please let
> me
> > > know
> > > > > if you have any questions, comments and concerns.
> > > > >
> > > > > Best,
> > > > >
> > > > > Junkai Xue
> > > > >
> > > >
> > >
> > >
> > >
> > > --
> > > Junkai Xue
> > >
> >
>
>
>
> --
> Junkai Xue
>


Re: Generate Helix release 0.6.8

2017-05-09 Thread kishore g
Does this include the batchMessage thread pool fix and fix to the race
condition

On Tue, May 9, 2017 at 5:08 PM, Xue Junkai  wrote:

> Hi Helix Devs,
>
> I am going to work on releasing Helix 0.6.8 this week. Please let me know
> if you have any questions, comments and concerns.
>
> Best,
>
> Junkai Xue
>


Re: Our github repo is down?

2017-04-28 Thread kishore g
https://git-wip-us.apache.org/repos/asf?p=helix.git;a=summary This is
accessible.

On Fri, Apr 28, 2017 at 10:06 AM, kishore g <g.kish...@gmail.com> wrote:

> I can't access it either.
>
> On Fri, Apr 28, 2017 at 10:00 AM, Lei Xia <l...@apache.org> wrote:
>
>> Seems our github repo is down?  https://github.com/apache/helix,  or it
>> is
>> just me can not access it?
>>
>>
>>
>> Lei
>>
>
>


Re: Our github repo is down?

2017-04-28 Thread kishore g
I can't access it either.

On Fri, Apr 28, 2017 at 10:00 AM, Lei Xia  wrote:

> Seems our github repo is down?  https://github.com/apache/helix,  or it is
> just me can not access it?
>
>
>
> Lei
>


Re: many exceptions during recovery from a total shutdown

2017-04-05 Thread kishore g
will need more logs to debug this. you can grep only for helix/zk related
logs.

This should not happen in general. Couple of scenarios when this can happen

   - The Participant was already in the middle of a transition when you
   shut it down.
   - Participant GC'ed has soon as it started and lost connection to ZK.

Was this a clean shutdown? Did you wait long enough the for liveinstance to
disappear?

what do you have in this line

at com.hcd.hcdadmin.CustomMessageHandlerFactory$CustomMessageHandler.
handleMessage(CustomMessageHandlerFactory.java:149
This does not appear to be executing state transition, if so who is sending
these custom messages to the participant.


On Wed, Apr 5, 2017 at 7:34 PM, Neutron Sharc <neutronsh...@gmail.com>
wrote:

> We are using helix-0.7.1
>
> All these logs are from participants, not from controller.
>
> -Shawn
>
>
> On Wed, Apr 5, 2017 at 7:09 PM, kishore g <g.kish...@gmail.com> wrote:
> > Hi Shawn,
> >
> > Are the logs on the participant or controller? what is the helix version?
> >
> >
> >
> > On Wed, Apr 5, 2017 at 6:36 PM, Neutron Sharc <neutronsh...@gmail.com>
> > wrote:
> >
> >> Hi all,
> >>
> >> We are testing a failure recovery scenario where I have many resources
> >> spanning many participants.  I shutdown all participants and helix
> >> admins, wait a while, then add each participant back into cluster.
> >> (zookeeper is on a separate cluster, not affected by shtudown)  During
> >> the recovery, it seems controller generates too many messages, and
> >> there so many exceptions.  Below are some examples.
> >>
> >> Are these exceptions expected?   Any comments are highly appreciated.
> >> Thanks.
> >>
> >>
> >> [ERROR 2017-04-05 14:26:17,734
> >> org.apache.helix.manager.zk.ZkBaseDataAccessor:303] Exception while
> >> updating path: /yy_cluster_name/INSTANCES/P60505029461/ERRORS/
> >> 1002d87a25a0589/USER_DEFINE_MSG/15ae0bd8-10
> >> 1f-4af3-acc3-36a486af4f4c
> >> org.I0Itec.zkclient.exception.ZkInterruptedException:
> >> java.lang.InterruptedException
> >>at org.I0Itec.zkclient.ZkClient.retryUntilConnected(ZkClient.
> >> java:687)
> >>at org.apache.helix.manager.zk.ZkClient.readData(ZkClient.
> java:240)
> >>at org.I0Itec.zkclient.ZkClient.readData(ZkClient.java:761)
> >>at org.apache.helix.manager.zk.ZkBaseDataAccessor.doUpdate(
> >> ZkBaseDataAccessor.java:273)
> >>at org.apache.helix.manager.zk.ZkBaseDataAccessor.update(
> >> ZkBaseDataAccessor.java:245)
> >>at org.apache.helix.manager.zk.ZKHelixDataAccessor.
> updateProperty(
> >> ZKHelixDataAccessor.java:150)
> >>at org.apache.helix.util.StatusUpdateUtil.publishErrorRecord(
> >> StatusUpdateUtil.java:501)
> >>at org.apache.helix.util.StatusUpdateUtil.
> >> publishStatusUpdateRecord(StatusUpdateUtil.java:435)
> >>at org.apache.helix.util.StatusUpdateUtil.
> >> logMessageStatusUpdateRecord(StatusUpdateUtil.java:334)
> >>at org.apache.helix.util.StatusUpdateUtil.logError(
> >> StatusUpdateUtil.java:342)
> >>at org.apache.helix.messaging.handling.HelixTask.call(
> >> HelixTask.java:163)
> >>at org.apache.helix.messaging.handling.HelixTask.call(
> >> HelixTask.java:42)
> >>at java.util.concurrent.FutureTask.run(FutureTask.java:266)
> >>at java.util.concurrent.ThreadPoolExecutor.runWorker(
> >> ThreadPoolExecutor.java:1142)
> >>at java.util.concurrent.ThreadPoolExecutor$Worker.run(
> >> ThreadPoolExecutor.java:617)
> >>at java.lang.Thread.run(Thread.java:745)
> >> Caused by: java.lang.InterruptedException
> >>at java.lang.Object.wait(Native Method)
> >>at java.lang.Object.wait(Object.java:502)
> >>at org.apache.zookeeper.ClientCnxn.submitRequest(
> >> ClientCnxn.java:1344)
> >>at org.apache.zookeeper.ZooKeeper.getData(ZooKeeper.java:925)
> >>at org.apache.zookeeper.ZooKeeper.getData(ZooKeeper.java:956)
> >>at org.I0Itec.zkclient.ZkConnection.readData(
> ZkConnection.java:103)
> >>at org.apache.helix.manager.zk.ZkClient$4.call(ZkClient.java:
> 244)
> >>at org.apache.helix.manager.zk.ZkClient$4.call(ZkClient.java:
> 240)
> >>at org.I0Itec.zkclient.ZkClient.retryUntilConnected(ZkClient.
> >> java:675)
> >>  

Re: many exceptions during recovery from a total shutdown

2017-04-05 Thread kishore g
Hi Shawn,

Are the logs on the participant or controller? what is the helix version?



On Wed, Apr 5, 2017 at 6:36 PM, Neutron Sharc 
wrote:

> Hi all,
>
> We are testing a failure recovery scenario where I have many resources
> spanning many participants.  I shutdown all participants and helix
> admins, wait a while, then add each participant back into cluster.
> (zookeeper is on a separate cluster, not affected by shtudown)  During
> the recovery, it seems controller generates too many messages, and
> there so many exceptions.  Below are some examples.
>
> Are these exceptions expected?   Any comments are highly appreciated.
> Thanks.
>
>
> [ERROR 2017-04-05 14:26:17,734
> org.apache.helix.manager.zk.ZkBaseDataAccessor:303] Exception while
> updating path: /yy_cluster_name/INSTANCES/P60505029461/ERRORS/
> 1002d87a25a0589/USER_DEFINE_MSG/15ae0bd8-10
> 1f-4af3-acc3-36a486af4f4c
> org.I0Itec.zkclient.exception.ZkInterruptedException:
> java.lang.InterruptedException
>at org.I0Itec.zkclient.ZkClient.retryUntilConnected(ZkClient.
> java:687)
>at org.apache.helix.manager.zk.ZkClient.readData(ZkClient.java:240)
>at org.I0Itec.zkclient.ZkClient.readData(ZkClient.java:761)
>at org.apache.helix.manager.zk.ZkBaseDataAccessor.doUpdate(
> ZkBaseDataAccessor.java:273)
>at org.apache.helix.manager.zk.ZkBaseDataAccessor.update(
> ZkBaseDataAccessor.java:245)
>at org.apache.helix.manager.zk.ZKHelixDataAccessor.updateProperty(
> ZKHelixDataAccessor.java:150)
>at org.apache.helix.util.StatusUpdateUtil.publishErrorRecord(
> StatusUpdateUtil.java:501)
>at org.apache.helix.util.StatusUpdateUtil.
> publishStatusUpdateRecord(StatusUpdateUtil.java:435)
>at org.apache.helix.util.StatusUpdateUtil.
> logMessageStatusUpdateRecord(StatusUpdateUtil.java:334)
>at org.apache.helix.util.StatusUpdateUtil.logError(
> StatusUpdateUtil.java:342)
>at org.apache.helix.messaging.handling.HelixTask.call(
> HelixTask.java:163)
>at org.apache.helix.messaging.handling.HelixTask.call(
> HelixTask.java:42)
>at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>at java.util.concurrent.ThreadPoolExecutor.runWorker(
> ThreadPoolExecutor.java:1142)
>at java.util.concurrent.ThreadPoolExecutor$Worker.run(
> ThreadPoolExecutor.java:617)
>at java.lang.Thread.run(Thread.java:745)
> Caused by: java.lang.InterruptedException
>at java.lang.Object.wait(Native Method)
>at java.lang.Object.wait(Object.java:502)
>at org.apache.zookeeper.ClientCnxn.submitRequest(
> ClientCnxn.java:1344)
>at org.apache.zookeeper.ZooKeeper.getData(ZooKeeper.java:925)
>at org.apache.zookeeper.ZooKeeper.getData(ZooKeeper.java:956)
>at org.I0Itec.zkclient.ZkConnection.readData(ZkConnection.java:103)
>at org.apache.helix.manager.zk.ZkClient$4.call(ZkClient.java:244)
>at org.apache.helix.manager.zk.ZkClient$4.call(ZkClient.java:240)
>at org.I0Itec.zkclient.ZkClient.retryUntilConnected(ZkClient.
> java:675)
>... 15 more
>
>
> [ERROR 2017-04-05 14:26:17,676
> org.apache.helix.messaging.handling.HelixTask:162] Exception after
> executing a message, msgId:
> 35e73c64-8fd3-4fb8-b0b8-419eacfa91a0org.I0Itec.zkclient.exception.
> ZkInterrupte
> dException: java.lang.InterruptedException
> org.I0Itec.zkclient.exception.ZkInterruptedException:
> java.lang.InterruptedException
>at org.I0Itec.zkclient.ZkClient.retryUntilConnected(ZkClient.
> java:687)
>at org.apache.helix.manager.zk.ZkClient.getChildren(ZkClient.
> java:212)
>at org.I0Itec.zkclient.ZkClient.deleteRecursive(ZkClient.java:505)
>at org.apache.helix.manager.zk.ZkBaseDataAccessor.remove(
> ZkBaseDataAccessor.java:537)
>at org.apache.helix.manager.zk.ZKHelixDataAccessor.removeProperty(
> ZKHelixDataAccessor.java:271)
>at org.apache.helix.messaging.handling.HelixTask.
> removeMessageFromZk(HelixTask.java:187)
>at org.apache.helix.messaging.handling.HelixTask.call(
> HelixTask.java:150)
>at org.apache.helix.messaging.handling.HelixTask.call(
> HelixTask.java:42)
>at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>at java.util.concurrent.ThreadPoolExecutor.runWorker(
> ThreadPoolExecutor.java:1142)
>at java.util.concurrent.ThreadPoolExecutor$Worker.run(
> ThreadPoolExecutor.java:617)
>at java.lang.Thread.run(Thread.java:745)
> Caused by: java.lang.InterruptedException
>at java.lang.Object.wait(Native Method)
>at java.lang.Object.wait(Object.java:502)
>at org.apache.zookeeper.ClientCnxn.submitRequest(
> ClientCnxn.java:1344)
>at org.apache.zookeeper.ZooKeeper.getChildren(ZooKeeper.java:1247)
>at org.apache.zookeeper.ZooKeeper.getChildren(ZooKeeper.java:1277)
>at org.I0Itec.zkclient.ZkConnection.getChildren(
> ZkConnection.java:99)
>at 

Re: [GitHub] helix issue #81: Creating a separate threadpool to handle batchMessages

2017-04-03 Thread kishore g
+1 on adding an api to enable/disable this at a cluster level.

On Mon, Apr 3, 2017 at 12:18 PM, dasahcc  wrote:

> Github user dasahcc commented on the issue:
>
> https://github.com/apache/helix/pull/81
>
> Looks good to me! I will do the following things for corresponding
> change:
> 1. Will add a test for this.
> 2. Will provide an API in HelixManager for enabling batch message and
> support cluster/resource level batch message enabling.
>
>
> ---
> If your project is set up for it, you can reply to this email and have your
> reply appear on GitHub as well. If your project does not have this feature
> enabled and wishes so, or if the feature is enabled but not working, please
> contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
> with INFRA.
> ---
>


Re: [ANNOUNCE] New committer: Junkai Xue

2017-04-03 Thread kishore g
Yay!.  Welcome to the club.

On Mon, Apr 3, 2017 at 10:45 AM, Lei Xia  wrote:

> The Project Management Committee (PMC) for Apache Helix has asked Junkai
> Xue
> to become a committer and we are pleased to announce that he has accepted.
>
> Being a committer enables easier contribution to the project since there is
> no need to go via the patch submission process. This should enable better
> productivity.
>
>
> Helix Team
>


Re: Helix with Hazelcast for in-memory data storage

2017-03-24 Thread kishore g
+ Helix-dev

Hi Shalakha,

We found that Hazelcast did not provide all the guarantees provided by
Zookeeper. As much as we would like to have a version of Helix that is
decoupled from Zookeeper, we haven't found an alternative yet.

thanks,
Kishore G

On Fri, Mar 24, 2017 at 2:08 AM, Shalakha Sidmul <shalakha.sid...@1eq.com>
wrote:

> Hello sir,
>
>
>
> We, at eQ-Technologic Pvt. Ltd., are working on a project which involves
> management of resources in a cluster environment.
>
> For example, connections to third party applications are resources for us.
>
> We also need an in-memory data store which is faster as compared to
> Zookeeper.
>
> Our exploration led us to Hazelcast which also provides a pretty robust
> task execution framework. (Task execution in cluster environment is another
> major requirement we have)
>
> We have tried and succeeded in using Helix for management of connections.
>
>
>
> In the following issues on JIRA, your comments suggest that you will soon
> have Helix backed by Hazelcast/Infinispan and remove the dependency on
> Zookeeper.
>
> 1.   https://issues.apache.org/jira/browse/HELIX-70
>
> 2.   https://issues.apache.org/jira/browse/SLING-2939
>
>
>
> (Both issues are about 4 years old.)
>
>
>
> My question is, has it been implemented yet?
> If not, is it under consideration?
>
> If yes, where can I find it?
>
> Is there a way to de-couple Helix and Zookeeper?
>
>
>
> Regards,
>
> Shalakha Sidmul
>
>
>


Re: dynamic zookeepr quorum

2017-03-18 Thread kishore g
Hi Shawn,

No, we haven't done that. The only way to achieve this as of now is to
restart the helix participant/controller/broker nodes after zookeeper is
reconfigured.

thanks,
Kishore G

On Sat, Mar 18, 2017 at 4:01 PM, Neutron Sharc <neutronsh...@gmail.com>
wrote:

> Hi all,
>
> Recent zookeeper 3.5 allows to dynamically grow zookeeper quorum.
> Does helix zookeeper clients perform runtime reconfigure to use the
> new zookeeper servers?
>
>
> -Shawn
>


Re: [ANNOUNCE] Apache Helix 0.6.7 Release

2017-02-01 Thread kishore g
Hi Lei,

I don't see this jar in the maven repo.
https://mvnrepository.com/artifact/org.apache.helix/helix-core

Can you please verify?

On Thu, Jan 26, 2017 at 10:10 PM, Lei Xia  wrote:

> The Apache Helix Team is pleased to announce the 10th release, 0.6.7, of
> the Apache Helix project.
>
> Apache Helix is a generic cluster management framework that makes it easy
> to build partitioned, fault tolerant, and scalable distributed systems.
>
> The full release notes are available here:
> http://helix.apache.org/0.6.7-docs/releasenotes/release-0.6.7.html
>
> You can declare a maven dependency to use it:
>
> 
>   org.apache.helix
>   helix-core
>   0.6.7
> 
>
> Or download the release sources:
> http://helix.apache.org/0.6.7-docs/download.cgi
>
> Additional info
>
> Website: http://helix.apache.org/
> Helix mailing lists: http://helix.apache.org/mail-lists.html
>
> We hope you will enjoy using the latest release of Apache Helix!
>
> Cheers,
> Apache Helix Team
>


Re: Double assignment , when participant is not able to establish connection with zookeeper quorum

2017-01-26 Thread kishore g
Can you please file a ticket and we can probably add this in the next
release.

Minor typo in my email.


   1. There is a pathological case where all zookeeper nodes get
   partitioned/crash/GC. In this case, we will make all participants
   disconnect and assume they don't own the partition. But when zookeepers
   come out of GC, it can continue as if nothing happened i.e it does not
   account for the time when its down. I can't think of a good solution for
   this scenario. Moreover, we *cannot* differentiate between a participant
   GC'ing/partitioned v/s ZK ensemble crash/partition/GC. This is typically
   avoided by ensuring ZK servers are deployed on different racks.


On Thu, Jan 26, 2017 at 10:34 AM, Subramanian Raghunathan <
subramanian.raghunat...@integral.com> wrote:

> Totally concur your thought,  a config based approach would be better .It
> could be tuned based on the acceptable tolerance and consistency.
>
>
>
> Thanks,
>
> Subramanian.
>
>
>
> Tel: +1 (650) 424 4655 <(650)%20424-4655>
>
>
>
> 3400 Hillview Ave, Building 4
>
> Palo Alto, CA 94304
>
> www.integral.com
>
> [image: Logo_signature_block]
> <http://www.integral.com/fxcloud_features/risk_management.html#ym>
>
> NOTICE: This e-mail message and any attachments, which may contain
> confidential information, are to be viewed solely by the intended recipient
> of Integral Development Corp. For further information, please visit
> http://www.integral.com/about/disclaimer.html.
>
>
>
>
>
>
>
> *From:* kishore g [mailto:g.kish...@gmail.com]
> *Sent:* Wednesday, January 25, 2017 7:12 PM
>
> *To:* u...@helix.apache.org
> *Cc:* d...@helix.incubator.apache.org
> *Subject:* Re: Double assignment , when participant is not able to
> establish connection with zookeeper quorum
>
>
>
> Helix can handle this and probably should. Couple of challenges here are
>
>1. How to generalize this across all use cases. This is a
>trade-off between availability and ensuring there is only one leader per
>partition.
>2. There is a pathological case where all zookeeper nodes get
>partitioned/crash/GC. In this case, we will make all participants
>disconnect and assume they don't own the partition. But when zookeepers
>come out of GC, it can continue as if nothing happened i.e it does not
>account for the time when its down. I can't think of a good solution for
>this scenario. Moreover, we can differentiate between a participant
>GC'ing/partitioned v/s ZK ensemble crash/partition/GC. This is typically
>avoided by ensuring ZK servers are deployed on different racks.
>
> Having said that, I think implementing a config based solution is worth
> it.
>
>
>
>
>
>
>
>
>
>
>
> On Wed, Jan 25, 2017 at 4:57 PM, Subramanian Raghunathan <
> subramanian.raghunat...@integral.com> wrote:
>
> Hi Kishore ,
>
>
>
> Thank you for the confirmation , yes we had solved it in
> similar lines and it did work for us (listening on the disconnect event
> from ZK).
>
>
>
> From the double assignment point of view is it an expected
> behavior from Helix and the users to handle the same ? Is there any plans
> to fix the same in future ?
>
>
>
> Because what I had observed when the network is flapping helix does handle
> it by calling reset () for the partition(s) from the (disconnect()), then
> why not in this case ?
>
>
>
> void 
> org.apache.helix.manager.zk.ZkHelixConnection.handleStateChanged(KeeperState
> state) throws Exception
>
>
>
> if (isFlapping()) {
>
> LOG.error("helix-connection: " + this + ", sessionId: " +
> _sessionId
>
> + " is flapping. diconnect it. " + " maxDisconnectThreshold: "
>
> + _maxDisconnectThreshold + " disconnects in " +
> _flappingTimeWindowMs + "ms");
>
> disconnect();
>
>   }
>
>
>
>
>
> Thanks & Regards,
>
> Subramanian.
>
>
>
> Tel: +1 (650) 424 4655 <(650)%20424-4655>
>
>
>
> 3400 Hillview Ave, Building 4
>
> Palo Alto, CA 94304
>
> www.integral.com
>
> [image: Logo_signature_block]
> <http://www.integral.com/fxcloud_features/risk_management.html#ym>
>
> NOTICE: This e-mail message and any attachments, which may contain
> confidential information, are to be viewed solely by the intended recipient
> of Integral Development Corp. For further information, please visit
> http://www.integral.com/about/disclaimer.html.
>
>
>
>
>
>
>
> *From:* kishore g [mailto:g.kish...@gmail.com]
>

Re: Generate Helix release 0.6.7

2017-01-13 Thread kishore g
Not really, let's do the release. I just wanted to get the list of
bugs/enhancements.

On Fri, Jan 13, 2017 at 8:18 PM, Lei Xia <l...@linkedin.com.invalid> wrote:

> Around 10 bug fixes.  If you think that is not good enough for a release,
> we can wait for Junkai's delay job execution get reviewed and I have also a
> proposed delayed rebalancer feature to be merged in?
>
>
> Thanks
> Lei
>
> On Fri, Jan 13, 2017 at 6:15 PM, kishore g <g.kish...@gmail.com> wrote:
>
> > What are the changes that will get into this release?
> >
> > On Fri, Jan 13, 2017 at 5:08 PM, Lei Xia <l...@apache.org> wrote:
> >
> > > Hi, Helix Devs
> > >
> > >I am going to work on releasing Helix 0.6.7 this week.  Please let
> me
> > > know if you have any questions, comments and concerns.  Thanks
> > >
> > >
> > > Best
> > > Lei
> > >
> >
>
>
>
> --
>
> *Lei Xia *Senior Software Engineer
> Data Infra/Nuage & Helix
> LinkedIn
>
> l...@linkedin.com
> www.linkedin.com/in/lxia1
>


Re: Delayed Workflow and Job Scheduling Design

2017-01-04 Thread kishore g
will review it today

On Tue, Jan 3, 2017 at 12:24 PM, Xue Junkai  wrote:

> Hi All,
>
> Here's the pull request of this design: https://github.com/
> apache/helix/pull/64
> Could anyone help me review it?
>
> Best,
>
> Junkai
>
> On Thu, Dec 8, 2016 at 6:09 PM, Xue Junkai  wrote:
>
>> Hi All,
>>
>> I have a short design for the Delayed Workflow and Job Scheduling. Since
>> I cannot access wiki, I attached with this email. Any feedbacks and
>> comments are highly appreciated!
>>
>> Best,
>>
>> Junkai
>> Overview
>>
>> Currently, Workflows and Jobs running by Helix requires more flexibility.
>> For example, some of the jobs need to be started after some jobs finished
>> for a certain mount of time. Same as Workflow, it may run at specific time,
>> when some operations have been done.  To better support Workflow and Job
>> scheduling, Helix should provide a new feature to let user setup the delay
>> time or starting for specific Workflows and Jobs. Workflows and Jobs should
>> have an option that allow user set starting time of this Workflow or Job or
>> set the delaying time for this Workflow and Job, when they are ready to
>> start. Then Workflows and Jobs can be scheduled at correct time.
>> Purposed Design
>>
>> The whole design has been split into two parts, generic rebalancer
>> scheduling and delay time calculation. Since Job scheduling can be done via
>> rerun WorkflowRebalancer, Workflow and Job delay scheduling can rely on the
>> same generic scheduling mechanism. Generic task scheduling tasks the
>> responsibiliy to set the running time for specific Workflow object. Then
>> each object has its own starting time calculation algorithm.
>>
>> Generic Task Scheduling
>>
>> For generic task scheduling, it is better to have a centralized
>> scheduler, RebalanceScheduler. It provides four public APIs:
>> public class RebalanceScheduler {
>> public void scheduleRebalance(HelixManager manager, String resource,
>> long startTime);
>>
>> public long getRebalanceTime(String resource);
>>
>> public long removeScheduledRebalance(String resource);
>>
>> public static void invokeRebalance(HelixDataAccessor accessor,
>> String resource);
>> }
>>
>>
>>
>> Obviously, it offers schedule a rebalancer, get schedule time of a
>> rebalancer and remove a rebalancer schedule. It also have an API that can
>> invoke rebalancer immediately. With this RebalancerScheduler, each resource
>> can be scheduled at certain start time.
>> Delay Time Calculation
>>
>> Workflows have a property expiryTime, which is the delay time that for
>> the Workflow. User can set it by call setExpiry method in WorkflowConfig.
>> For Job, two methods, in JobConfig, will be provided: setExecutionStart and
>> setExecutionDelay. Through these API, user can set the delay time and start
>> time for Workflows and Jobs. Internally, Helix will take the delay time and
>> start time, which is later.
>>
>> For the logic implemented in computing Workflows and Jobs, Helix choose
>> to do real time computation. User can set delay time or start time at
>> JobConfig. When the job is ready to run, Helix will calculate the "start
>> time" for delay via current time plus the delay time. Then compare it with
>> start time if user set it up in JobConfig.
>>
>> [image: Inline image 1]
>> Impact
>>
>>- From user perspective, user have to understand the difference
>>between delay time and start time.
>>- The WorkflowRebalancer will be called multiple times, which might
>>be considered for performance.
>>
>>
>
>
> --
> Junkai Xue
>


Re: Merge 0.6.x and 0.7.x to new 0.8.x branch

2016-11-22 Thread kishore g
That makes sense. Let's get the good parts of 0.7 and come up with a good
api module.

On Tue, Nov 22, 2016 at 9:06 AM, Lei Xia <l...@linkedin.com.invalid> wrote:

> Hi, Kishore
>
>I agree, we can not guarantee backward compatibility with both 0.6 and
> 0.7.  We will make sure it is back-compatible with 0.6 since I think most
> of our existing users are using this version, also we should try to make
> sure migration from 0.7 to 0.8 is minimized.  Would like to hear if you
> have any suggestions.
>
>
> Thanks
> Lei
>
> On Mon, Nov 21, 2016 at 10:28 PM, kishore g <g.kish...@gmail.com> wrote:
>
> > I like the overall idea. One concern is that it might be hard to maintain
> > backward compatibility with both 0.6 and 0.7.
> >
> > On Mon, Nov 21, 2016 at 10:17 PM, Lei Xia <l...@apache.org> wrote:
> >
> > > Hi, All
> > >
> > >Helix 0.7.x branch has been there for a while, however, given it has
> > > back-incompatible API changes, most of our exiting customers are
> > reluctant
> > > to move to 0.7.  This forces us to maintain both branches, in addition,
> > > most of recent new features and important fixes  (task framework
> > > improvements, new auto-rebalancer features) have only been pushed to
> > > 0.6.x,  which makes two branches diverged further apart.  It is
> > especially
> > > harder to keep maintaining both branches now.
> > >
> > >I proposed to fork a new branch (helix-0.8.x) from 0.6.x, with a new
> > > helix-api module containing all new API classes introduced in 0.7, but
> > > still also keeps all old API classes (maybe marked as deprecated) in
> > > helix-core.  In this way, we could push existing customers move to
> 0.8.x
> > > release without forcing them to remodel their codes.
> > > Then we only need to maintain a single unified branch, and keep moving
> > > forwards to new API with all new developments happening in this branch.
> > >
> > >I have cloned a 0.8.x-test branch (
> > > https://github.com/apache/helix/tree/helix-0.8.x-test) from 0.6.x, and
> > we
> > > (me and Junkai) are going to cherry-pick changes from 0.7.x and apply
> > them
> > > to this branch and continue testing it until we reach a point that we
> can
> > > confidently release it :).
> > >
> > >Please let me know what you think about it, any suggestions or
> > comments
> > > are appreciated!  Thanks
> > >
> > >
> > > Best
> > > Lei
> > >
> >
>
>
>
> --
>
> *Lei Xia *Senior Software Engineer
> Data Infra/Nuage & Helix
> LinkedIn
>
> l...@linkedin.com
> www.linkedin.com/in/lxia1
>


Re: Merge 0.6.x and 0.7.x to new 0.8.x branch

2016-11-21 Thread kishore g
I like the overall idea. One concern is that it might be hard to maintain
backward compatibility with both 0.6 and 0.7.

On Mon, Nov 21, 2016 at 10:17 PM, Lei Xia  wrote:

> Hi, All
>
>Helix 0.7.x branch has been there for a while, however, given it has
> back-incompatible API changes, most of our exiting customers are reluctant
> to move to 0.7.  This forces us to maintain both branches, in addition,
> most of recent new features and important fixes  (task framework
> improvements, new auto-rebalancer features) have only been pushed to
> 0.6.x,  which makes two branches diverged further apart.  It is especially
> harder to keep maintaining both branches now.
>
>I proposed to fork a new branch (helix-0.8.x) from 0.6.x, with a new
> helix-api module containing all new API classes introduced in 0.7, but
> still also keeps all old API classes (maybe marked as deprecated) in
> helix-core.  In this way, we could push existing customers move to 0.8.x
> release without forcing them to remodel their codes.
> Then we only need to maintain a single unified branch, and keep moving
> forwards to new API with all new developments happening in this branch.
>
>I have cloned a 0.8.x-test branch (
> https://github.com/apache/helix/tree/helix-0.8.x-test) from 0.6.x, and we
> (me and Junkai) are going to cherry-pick changes from 0.7.x and apply them
> to this branch and continue testing it until we reach a point that we can
> confidently release it :).
>
>Please let me know what you think about it, any suggestions or comments
> are appreciated!  Thanks
>
>
> Best
> Lei
>


Re: [VOTE] Apache Helix 0.6.6 Release

2016-11-06 Thread kishore g
+1.

On Nov 6, 2016 7:00 PM, "Olivier Lamy"  wrote:

> +1
>
> On 3 November 2016 at 09:19, Lei Xia  wrote:
>
> > Hi,
> >
> > This is to call for a vote on releasing the following candidate as
> > Apache Helix 0.6.6. This is the 9th release of Helix as an Apache
> > project, as well as the 5th release as a top-level Apache project.
> >
> > Apache Helix is a generic cluster management framework that makes it
> > easy to build partitioned and replicated, fault-tolerant and scalable
> > distributed systems.
> >
> > Release notes:http://helix.apache.org/0.6.6-docs/releasenotes/
> > release-0.6.6.html#
> >
> > Release artifacts:https://repository.apache.org/content/
> > repositories/orgapachehelix-1007
> >
> > Distribution:
> > * binaries:https://dist.apache.org/repos/dist/dev/helix/0.6.6/binaries/
> > * sources:https://dist.apache.org/repos/dist/dev/helix/0.6.6/src/
> >
> > The [VERSION] release
> > tag:https://git-wip-us.apache.org/repos/asf?p=helix.git;a=
> > tag;h=refs/tags/helix-0.6.6
> >
> > KEYS file available here:https://dist.apache.org/
> repos/dist/dev/helix/KEYS
> >
> > Please vote on the release. The vote will be open for at least 72 hours.
> >
> > [+1] -- "YES, release"
> > [0] -- "No opinion"
> > [-1] -- "NO, do not release"
> >
> > Thanks,
> > The Apache Helix Team
> >
>
>
>
> --
> Olivier Lamy
> http://twitter.com/olamy | http://linkedin.com/in/olamy
>


Re: [GitHub] helix issue #58: helix-core: AutoRebalancer should include only numbered sta...

2016-11-03 Thread kishore g
Yes. Thanks will review 59.

On Nov 3, 2016 9:30 AM, "mkscrg"  wrote:

> Github user mkscrg commented on the issue:
>
> https://github.com/apache/helix/pull/58
>
> @kishoreg see #59 for a version of the sorting strategy
>
>
> ---
> If your project is set up for it, you can reply to this email and have your
> reply appear on GitHub as well. If your project does not have this feature
> enabled and wishes so, or if the feature is enabled but not working, please
> contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
> with INFRA.
> ---
>


Re: Generate Helix release 0.6.6

2016-10-25 Thread kishore g
Thanks Lei. I will test it tonight.

On Tue, Oct 25, 2016 at 2:07 PM, Lei Xia <xiax...@gmail.com> wrote:

> Finally, the release 0.6.6 is staged:  https://dist.apache.org/repos/
> dist/dev/helix/helix-0.6.6/
> <https://dist.apache.org/repos/dist/dev/helix/helix-0.6.6/>.  I will test
> it and then start a vote on it.  Please also give it a test if you have a
> chance.  Thanks
>
>
> Best
> Lei
>
> On Sun, Oct 2, 2016 at 2:53 PM, kishore g <g.kish...@gmail.com> wrote:
>
> > Thanks Lei. Let us know when you have the release. We can test it.
> >
> > On Wed, Sep 28, 2016 at 6:43 PM, Lei Xia <l...@apache.org> wrote:
> >
> > > Hi, Helix Developers
> > >
> > >I am going to work on releasing Helix 0.6.6 this week.  Please let
> me
> > > know if you have any questions, comments and concerns.  Thanks
> > >
> > >
> > >
> > > Best
> > > Lei
> > >
> >
>
>
>
> --
> Lei Xia
>


Re: [jira] [Updated] (HELIX-632) Upgrade zkclient to the latest version (0.9.0)

2016-07-18 Thread kishore g
No, it should be OK to upgrade. Please make sure tests pass and try it out
in your deployment
On Jul 18, 2016 11:48 PM, "Vinayak Borkar (JIRA)"  wrote:

>
>  [
> https://issues.apache.org/jira/browse/HELIX-632?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
> ]
>
> Vinayak Borkar updated HELIX-632:
> -
> Description: The zkclient library seems to have moved from 0.1 (what
> is being currently used in helix) to 0.9 and has received a bunch of code
> fixes. Do you guys see any issues with upgrading to the latest version?
> (was: The zkclient library seems to have moved from 0.1 to 0.9 and has
> received a bunch of code fixes. Do you guys see any issues with upgrading
> to the latest version?)
>
> > Upgrade zkclient to the latest version (0.9.0)
> > --
> >
> > Key: HELIX-632
> > URL: https://issues.apache.org/jira/browse/HELIX-632
> > Project: Apache Helix
> >  Issue Type: Improvement
> >Reporter: Vinayak Borkar
> >Assignee: Vinayak Borkar
> >
> > The zkclient library seems to have moved from 0.1 (what is being
> currently used in helix) to 0.9 and has received a bunch of code fixes. Do
> you guys see any issues with upgrading to the latest version?
>
>
>
> --
> This message was sent by Atlassian JIRA
> (v6.3.4#6332)
>


Re: reuse replicas after a failed participant reconnects

2016-06-07 Thread kishore g
Helix does that automatically in SEMI AUTO mode. If you are implementing
your own rebalancer, take a look at the SemiAutoRebalancer code in Helix.



On Mon, Jun 6, 2016 at 10:00 PM, Neutron sharc 
wrote:

> Hi Lei,
>
> At the time a participant dies,  it owns some master/slave partitions.
> How to convert those partitions to "OFFLINE",  given that the
> participant already died and no one is running the state machine for
> those partitions.  Thanks.
>
> -Neutron
>
> On Mon, Jun 6, 2016 at 11:04 AM, Lei Xia 
> wrote:
> > Hi, Neutron
> >
> >   Could you be more specific on your question?  In semi-auto mode, you
> (the
> > client) will specify a preference list for each partition (or Helix can
> > generate the list for you too by calling admin.rebalance()).  The list is
> > fixed, i,e Helix will not automatically recalculate the list.
> >
> >   Given a preference list for a partitions, for example:
> >
> >   { p0: [node-1, node-2, node-3].  ...}
> >
> >   Helix will try to bring p0 to online state for node 1,2,3.  If node-1
> is
> > disconnected from zookeeper (crashed, for example), the state of p0 on
> > node-1 will be offline. Once node-1 comes back, Helix will bring p0 on
> > node-1 back from offline to online.
> >
> >   Not sure if this answers your question.
> >
> >
> > Thanks
> > Lei
> >
> >
> > On Fri, Jun 3, 2016 at 2:55 PM, Neutron sharc 
> > wrote:
> >
> >> Hi the team,
> >>
> >> semi-auto mode supports a feature that, after a failed participant
> >> comes back online, its owned replicas will be reused again (transit
> >> from offline to slave etc).  How can Helix recognize the replicas that
> >> are owned by a participant after it reconnects after a failure?We
> >> are trying to build such a feature in a user-defined rebalancer.   You
> >> input is highly appreciated.
> >>
> >>
> >> -neutron
> >>
> >
> >
> >
> > --
> >
> > *Lei Xia *Senior Software Engineer
> > Distributed Data Systems/Nuage & Helix
> > LinkedIn
> >
> > l...@linkedin.com
> > www.linkedin.com/in/lxia1
>


Re: error when reading znode: invalid stream header: 7B0A2020

2016-06-06 Thread kishore g
zkClient.setStreamingSerializer(new ZNRecordSerialiazer()) something like
that.

On Mon, Jun 6, 2016 at 6:00 PM, Neutron sharc 
wrote:

> Hi the team,
>
> I want to read this znode to get partitions assigned to a dead participant:
> "/INSTANCES//CURRENTSTATES/ id>/"
>
> I use this code snippet to read:
>
> accessor = new ZkBaseDataAccessor(zkClient);
> String path = x;
> ZNRecord record = accessor.get(path, null, AccessOption.PERSISTENT);
>
> Immediately at get() I got the following exception about invalid
> stream header.  What's the "right" way to access that znode?  Thanks!
>
> [ERROR 2016-06-06 16:43:22,214
> com.hcd.hcdadmin.InstancePropertyAccessor:115] failed to read znode
> /shawn1/INSTANCES/node5_pp1/CURRENTSTATES/1029c4a143b/Pool0
> org.I0Itec.zkclient.exception.ZkMarshallingError:
> java.io.StreamCorruptedException: invalid stream header: 7B0A2020
> at
> org.I0Itec.zkclient.serialize.SerializableSerializer.deserialize(SerializableSerializer.java:37)
> at
> org.apache.helix.manager.zk.BasicZkSerializer.deserialize(BasicZkSerializer.java:41)
> at org.apache.helix.manager.zk.ZkClient.deserialize(ZkClient.java:231)
> at org.apache.helix.manager.zk.ZkClient.readData(ZkClient.java:247)
> at org.I0Itec.zkclient.ZkClient.readData(ZkClient.java:761)
> at
> org.apache.helix.manager.zk.ZkBaseDataAccessor.get(ZkBaseDataAccessor.java:322)
> at
> com.hcd.hcdadmin.InstancePropertyAccessor.getReplicas(InstancePropertyAccessor.java:103)
> at
> com.hcd.hcdadmin.InstancePropertyAccessor.main(InstancePropertyAccessor.java:139)
> Caused by: java.io.StreamCorruptedException: invalid stream header:
> 7B0A2020
> at java.io.ObjectInputStream.readStreamHeader(ObjectInputStream.java:808)
> at java.io.ObjectInputStream.(ObjectInputStream.java:301)
> at
> org.I0Itec.zkclient.serialize.SerializableSerializer.deserialize(SerializableSerializer.java:31)
> ... 7 more
>


Re: [jira] [Created] (HELIX-630) Unable to start simple Participant. Threading issues?

2016-05-27 Thread kishore g
Can you provide the participant code snippet. My guess is you are
registering the statemodelfactory after connecting to the cluster. Here is
the code from quickstart. Can you confirm you have the right order

MasterSlaveStateModelFactory stateModelFactory = new
MasterSlaveStateModelFactory(instanceName); StateMachineEngine stateMach =
manager.getStateMachineEngine();
stateMach.registerStateModelFactory(STATE_MODEL_NAME, stateModelFactory);
manager.connect();


On Fri, May 27, 2016 at 12:59 AM, stuart meikle (JIRA) 
wrote:

> stuart meikle created HELIX-630:
> ---
>
>  Summary: Unable to start simple Participant. Threading issues?
>  Key: HELIX-630
>  URL: https://issues.apache.org/jira/browse/HELIX-630
>  Project: Apache Helix
>   Issue Type: Bug
>   Components: helix-core
> Affects Versions: 0.7.1, 0.6.5
>  Environment: Windows 7 64 bit. Running from Intellij.
> Reporter: stuart meikle
>
>
> I have a controller app and a v simple participant app, both derived from
> the Quickstart.java. I start the controller app in intellij and then up to
> 3 participant apps. Sometimes the participant apps fail to start. I'll
> attach the logs below. Error appears to occur in manager.connect, and
> appears intermittently. I noticed you had an earlier bug back in 2014 with
> similar symptoms.
>
> --
>
> D:\dev\bin\sun\jdk\1.8.0_25-64bit\bin\java -Didea.launcher.port=7578
> -Didea.launcher.bin.path=D:\dev\bin\IntelliJIDEA14.1.5\bin
> -Dfile.encoding=windows-1252 -classpath
> 

Re: calling ZKHelixLock from state machine transition

2016-05-25 Thread kishore g
This feature is available on 0.7.x only.

On Wed, May 25, 2016 at 9:20 PM, Neutron sharc <neutronsh...@gmail.com>
wrote:

> Thanks Kishore.  I tested this patch on tag 0.7.1 because we are using
> 0.7.1.  The code is same in master branch though.
>
> Where do you want this patch to land?
>
> On Wed, May 25, 2016 at 8:35 PM, kishore g <g.kish...@gmail.com> wrote:
> > sorry I missed this. It looked good to me. Will merge it.
> >
> > On Wed, May 25, 2016 at 7:41 PM, Neutron sharc <neutronsh...@gmail.com>
> > wrote:
> >
> >> Hi Kishore, Kanak,   any updates?
> >>
> >> On Thu, May 19, 2016 at 4:13 PM, kishore g <g.kish...@gmail.com> wrote:
> >> > Thanks Shawn. Will review it tonight. Kanak, It will be great if you
> can
> >> > take a look at it as well.
> >> >
> >> > On Thu, May 19, 2016 at 3:45 PM, Neutron sharc <
> neutronsh...@gmail.com>
> >> > wrote:
> >> >
> >> >> Hi Helix team,
> >> >>
> >> >> I uploaded a PR to fix this bug:
> >> https://github.com/apache/helix/pull/44
> >> >>
> >> >> Thanks.
> >> >>
> >> >> On Wed, May 18, 2016 at 11:01 PM, Neutron sharc <
> neutronsh...@gmail.com
> >> >
> >> >> wrote:
> >> >> > Hi Kanak,
> >> >> >
> >> >> > The same problem with zk helix lock re-appears.  I found some clues
> >> >> > about the potential bug.  This potential bug causes all threads
> >> >> > competing for a same zk helix lock to be blocked.
> >> >> >
> >> >> > In my test there are two java threads blocked when trying to grab
> zk
> >> >> > lock (thread 15 and thread 19)
> >> >> >
> >> >> > Here are related logs before the threads are blocked (inlined with
> my
> >> >> comments)
> >> >> >
> >> >> > [INFO  2016-05-18 22:19:54,057 com.hcd.hcdadmin.M1Rebalancer:70]
> >> >> > rebalancer thread 15 before zklock
> >> >> >   =>  T15 enters
> >> >> >
> >> >> > [DEBUG 2016-05-18 22:19:54,069
> org.apache.helix.lock.zk.WriteLock:193]
> >> >> > Created id:
> >> /shawn1/LOCKS/RESOURCE_Pool0/x-72233245264911661-78
> >> >> >=>  T15  creates znode,  T15 is the smallest so it owns lock
> >> >> >
> >> >> > [INFO  2016-05-18 22:19:54,071 com.hcd.hcdadmin.M1Rebalancer:70]
> >> >> > rebalancer thread 19 before zklock
> >> >> >=> T19 enters
> >> >> >
> >> >> > [INFO  2016-05-18 22:19:54,071 com.hcd.hcdadmin.M1Rebalancer:72]
> >> >> > rebalancer thread 15 start computing for controller host1_admin
> >> >> >=> T15 performs its work
> >> >> >
> >> >> > [DEBUG 2016-05-18 22:19:54,080
> org.apache.helix.lock.zk.WriteLock:193]
> >> >> > Created id:
> >> /shawn1/LOCKS/RESOURCE_Pool0/x-72233245264911662-79
> >> >> > =>  T19 creates its znode
> >> >> >
> >> >> > [DEBUG 2016-05-18 22:19:54,081
> org.apache.helix.lock.zk.WriteLock:233]
> >> >> > watching less than me node:
> >> >> > /shawn1/LOCKS/RESOURCE_Pool0/x-72233245264911661-78
> >> >> > =>  T19 found its predecessor to wait for, which is T15
> >> >> >
> >> >> > [WARN  2016-05-18 22:19:54,084
> org.apache.helix.lock.zk.WriteLock:239]
> >> >> > Could not find the stats for less than me:
> >> >> > /shawn1/LOCKS/RESOURCE_Pool0/x-72233245264911661-78
> >> >> > =>  T19 calls zookeeper.exist() to register a watcher on T15,
> but
> >> >> > T15 has called unlock() to delete the znode at the same moment.  So
> >> >> > T19 continues to check while(id==null) loop.  Because T19 id is not
> >> >> > null now,  T19's LockZooKeeperOperation.execute() returns false.
> T19
> >> >> > will block at wait(), hoping somebody else will notify it.  But
> since
> >> >> > T19 is currently the smallest so nobody else can grab the lock and
> >> >> > wait up T19;  T19 blocks, and every subsequent caller also blocks.
> >> >> >
> >> >> > The code that leads to the problem is here:
> >> >> &g

Re: calling ZKHelixLock from state machine transition

2016-05-25 Thread kishore g
sorry I missed this. It looked good to me. Will merge it.

On Wed, May 25, 2016 at 7:41 PM, Neutron sharc <neutronsh...@gmail.com>
wrote:

> Hi Kishore, Kanak,   any updates?
>
> On Thu, May 19, 2016 at 4:13 PM, kishore g <g.kish...@gmail.com> wrote:
> > Thanks Shawn. Will review it tonight. Kanak, It will be great if you can
> > take a look at it as well.
> >
> > On Thu, May 19, 2016 at 3:45 PM, Neutron sharc <neutronsh...@gmail.com>
> > wrote:
> >
> >> Hi Helix team,
> >>
> >> I uploaded a PR to fix this bug:
> https://github.com/apache/helix/pull/44
> >>
> >> Thanks.
> >>
> >> On Wed, May 18, 2016 at 11:01 PM, Neutron sharc <neutronsh...@gmail.com
> >
> >> wrote:
> >> > Hi Kanak,
> >> >
> >> > The same problem with zk helix lock re-appears.  I found some clues
> >> > about the potential bug.  This potential bug causes all threads
> >> > competing for a same zk helix lock to be blocked.
> >> >
> >> > In my test there are two java threads blocked when trying to grab zk
> >> > lock (thread 15 and thread 19)
> >> >
> >> > Here are related logs before the threads are blocked (inlined with my
> >> comments)
> >> >
> >> > [INFO  2016-05-18 22:19:54,057 com.hcd.hcdadmin.M1Rebalancer:70]
> >> > rebalancer thread 15 before zklock
> >> >   =>  T15 enters
> >> >
> >> > [DEBUG 2016-05-18 22:19:54,069 org.apache.helix.lock.zk.WriteLock:193]
> >> > Created id:
> /shawn1/LOCKS/RESOURCE_Pool0/x-72233245264911661-78
> >> >=>  T15  creates znode,  T15 is the smallest so it owns lock
> >> >
> >> > [INFO  2016-05-18 22:19:54,071 com.hcd.hcdadmin.M1Rebalancer:70]
> >> > rebalancer thread 19 before zklock
> >> >=> T19 enters
> >> >
> >> > [INFO  2016-05-18 22:19:54,071 com.hcd.hcdadmin.M1Rebalancer:72]
> >> > rebalancer thread 15 start computing for controller host1_admin
> >> >=> T15 performs its work
> >> >
> >> > [DEBUG 2016-05-18 22:19:54,080 org.apache.helix.lock.zk.WriteLock:193]
> >> > Created id:
> /shawn1/LOCKS/RESOURCE_Pool0/x-72233245264911662-79
> >> > =>  T19 creates its znode
> >> >
> >> > [DEBUG 2016-05-18 22:19:54,081 org.apache.helix.lock.zk.WriteLock:233]
> >> > watching less than me node:
> >> > /shawn1/LOCKS/RESOURCE_Pool0/x-72233245264911661-78
> >> > =>  T19 found its predecessor to wait for, which is T15
> >> >
> >> > [WARN  2016-05-18 22:19:54,084 org.apache.helix.lock.zk.WriteLock:239]
> >> > Could not find the stats for less than me:
> >> > /shawn1/LOCKS/RESOURCE_Pool0/x-72233245264911661-78
> >> > =>  T19 calls zookeeper.exist() to register a watcher on T15, but
> >> > T15 has called unlock() to delete the znode at the same moment.  So
> >> > T19 continues to check while(id==null) loop.  Because T19 id is not
> >> > null now,  T19's LockZooKeeperOperation.execute() returns false. T19
> >> > will block at wait(), hoping somebody else will notify it.  But since
> >> > T19 is currently the smallest so nobody else can grab the lock and
> >> > wait up T19;  T19 blocks, and every subsequent caller also blocks.
> >> >
> >> > The code that leads to the problem is here:
> >> >
> >>
> https://github.com/apache/helix/blob/master/helix-core/src/main/java/org/apache/helix/lock/zk/WriteLock.java#L238
> >> >
> >> > One possible fix is to just set id to null at line 240 and let while()
> >> > loop to retry.
> >> >
> >>
> https://github.com/apache/helix/blob/master/helix-core/src/main/java/org/apache/helix/lock/zk/WriteLock.java#L240
> >> >
> >> >
> >> > [INFO  2016-05-18 22:19:54,092 com.hcd.hcdadmin.M1Rebalancer:125]
> >> > rebalancer thread 15 released zklock and returned
> >> >   =>  T15 has deleted znode a short while ago and returns from this
> >> method
> >> >
> >> >
> >> > [INFO  2016-05-18 22:19:54,179 com.hcd.hcdadmin.M1Rebalancer:70]
> >> > rebalancer thread 15 before zklock
> >> >   =>  T15 calls this method again,
> >> >
> >> > [DEBUG 2016-05-18 22:19:54,191 org.apache.helix.lock.zk.WriteLock:193]
> >> > Created id:
> /shawn1/LOCKS/RES

Re: calling ZKHelixLock from state machine transition

2016-05-19 Thread kishore g
Thanks Shawn. Will review it tonight. Kanak, It will be great if you can
take a look at it as well.

On Thu, May 19, 2016 at 3:45 PM, Neutron sharc 
wrote:

> Hi Helix team,
>
> I uploaded a PR to fix this bug:   https://github.com/apache/helix/pull/44
>
> Thanks.
>
> On Wed, May 18, 2016 at 11:01 PM, Neutron sharc 
> wrote:
> > Hi Kanak,
> >
> > The same problem with zk helix lock re-appears.  I found some clues
> > about the potential bug.  This potential bug causes all threads
> > competing for a same zk helix lock to be blocked.
> >
> > In my test there are two java threads blocked when trying to grab zk
> > lock (thread 15 and thread 19)
> >
> > Here are related logs before the threads are blocked (inlined with my
> comments)
> >
> > [INFO  2016-05-18 22:19:54,057 com.hcd.hcdadmin.M1Rebalancer:70]
> > rebalancer thread 15 before zklock
> >   =>  T15 enters
> >
> > [DEBUG 2016-05-18 22:19:54,069 org.apache.helix.lock.zk.WriteLock:193]
> > Created id: /shawn1/LOCKS/RESOURCE_Pool0/x-72233245264911661-78
> >=>  T15  creates znode,  T15 is the smallest so it owns lock
> >
> > [INFO  2016-05-18 22:19:54,071 com.hcd.hcdadmin.M1Rebalancer:70]
> > rebalancer thread 19 before zklock
> >=> T19 enters
> >
> > [INFO  2016-05-18 22:19:54,071 com.hcd.hcdadmin.M1Rebalancer:72]
> > rebalancer thread 15 start computing for controller host1_admin
> >=> T15 performs its work
> >
> > [DEBUG 2016-05-18 22:19:54,080 org.apache.helix.lock.zk.WriteLock:193]
> > Created id: /shawn1/LOCKS/RESOURCE_Pool0/x-72233245264911662-79
> > =>  T19 creates its znode
> >
> > [DEBUG 2016-05-18 22:19:54,081 org.apache.helix.lock.zk.WriteLock:233]
> > watching less than me node:
> > /shawn1/LOCKS/RESOURCE_Pool0/x-72233245264911661-78
> > =>  T19 found its predecessor to wait for, which is T15
> >
> > [WARN  2016-05-18 22:19:54,084 org.apache.helix.lock.zk.WriteLock:239]
> > Could not find the stats for less than me:
> > /shawn1/LOCKS/RESOURCE_Pool0/x-72233245264911661-78
> > =>  T19 calls zookeeper.exist() to register a watcher on T15, but
> > T15 has called unlock() to delete the znode at the same moment.  So
> > T19 continues to check while(id==null) loop.  Because T19 id is not
> > null now,  T19's LockZooKeeperOperation.execute() returns false. T19
> > will block at wait(), hoping somebody else will notify it.  But since
> > T19 is currently the smallest so nobody else can grab the lock and
> > wait up T19;  T19 blocks, and every subsequent caller also blocks.
> >
> > The code that leads to the problem is here:
> >
> https://github.com/apache/helix/blob/master/helix-core/src/main/java/org/apache/helix/lock/zk/WriteLock.java#L238
> >
> > One possible fix is to just set id to null at line 240 and let while()
> > loop to retry.
> >
> https://github.com/apache/helix/blob/master/helix-core/src/main/java/org/apache/helix/lock/zk/WriteLock.java#L240
> >
> >
> > [INFO  2016-05-18 22:19:54,092 com.hcd.hcdadmin.M1Rebalancer:125]
> > rebalancer thread 15 released zklock and returned
> >   =>  T15 has deleted znode a short while ago and returns from this
> method
> >
> >
> > [INFO  2016-05-18 22:19:54,179 com.hcd.hcdadmin.M1Rebalancer:70]
> > rebalancer thread 15 before zklock
> >   =>  T15 calls this method again,
> >
> > [DEBUG 2016-05-18 22:19:54,191 org.apache.helix.lock.zk.WriteLock:193]
> > Created id: /shawn1/LOCKS/RESOURCE_Pool0/x-72233245264911663-80
> >   => T15 creates znode
> >
> > [DEBUG 2016-05-18 22:19:54,193 org.apache.helix.lock.zk.WriteLock:233]
> > watching less than me node:
> > /shawn1/LOCKS/RESOURCE_Pool0/x-72233245264911662-79
> >
> >   => T15 found T19 to be smallest so it waits for T19.  Nobody will
> > wake up T19,  so T15 is also blocked.
> >
> >
> >
> >
> > Any comments appreciated. Thanks.
> >
> >
> > -Neutronsharc
> >
> >
> >
> > On Sat, May 14, 2016 at 5:20 PM, Neutron sharc 
> wrote:
> >> We increased the max connections allowed per client at zk server side.
> >> The problem is gone now.
> >>
> >> On Tue, May 10, 2016 at 2:50 PM, Neutron sharc 
> wrote:
> >>> Hi Kanak,  thanks for reply.
> >>>
> >>> The problem is gone if we set a constraint of 1 on "STATE_TRANSITION"
> >>> for the resource.  If we allow multiple state transitions to be
> >>> executed in the resource,  then this zklock problem occurs.
> >>>
> >>> btw,  we run multiple participants in a same jvm in our test.  In
> >>> other words, there are multiple java threads in a same jvm competing
> >>> for zklock.
> >>>
> >>> We haven't profiled the ZKHelixLock._listener.lockAcquired() since we
> >>> bypassed this problem using constraint.  Will revisit it later.
> >>>
> >>>
> >>>
> >>>
> >>> On Mon, May 9, 2016 at 8:28 PM, Kanak Biscuitwala 
> wrote:
>  Hi,
> 
>  ZkHelixLock is a thin wrapper around the ZooKeeper WriteLock recipe
> (which was last changed over 5 

Re: want to get dead replicas in USER_DEFINED rebalance callback

2016-05-02 Thread kishore g
Can you paste the initial IS that you compute. I am mainly interested in
the simple fields.

On Mon, May 2, 2016 at 12:38 PM, Neutron sharc <neutronsh...@gmail.com>
wrote:

> Thanks Kishore for your reply!
>
> What I see is that, a resource's ideal state becomes empty after
> external view converges with the assignment.
>
> When I create a resource I compute an initial IS, and attach a
> USER_DEFINED rebalancer.  After Helix stabilizes a resource,  its
> "IDEALSTATES" mapFields is wiped off in zookeeper.  When the next
> round of rebalancing starts,  computeResourceMapping() will always get
> an empty idealState.
>
>
>
>
> On Fri, Apr 29, 2016 at 1:16 PM, kishore g <g.kish...@gmail.com> wrote:
> > Hi,
> >
> > Current state will not show dead replicas. You need to use previous
> > idealstate to derive that info. The logic will be something like this
> >
> > computeResource(.) {
> >   List instances =
> > previousIdealState.getInstancesForPartition(P0)
> > foreach instance
> >   if(!liveInstances.contain(instance)){
> > //NEED TO ASSIGN ANOTHER INSTANCE TO FOR THIS PARTITION
> >   }
> > }
> >
> > This allows your logic to be idempotent and not depend on incremental
> > changes.
> >
> > thanks,
> > Kishore G
> >
> > On Thu, Apr 28, 2016 at 4:27 PM, Neutron sharc <neutronsh...@gmail.com>
> > wrote:
> >
> >> Hi team,
> >>
> >> in USER_DEFINED rebalance mode, the callback computeResourceMapping()
> >> accepts a “currentState”.  Does this variable include replicas on a
> >> dead participant ?
> >>
> >> For example, my resource has a partition P1 master replica on
> >> participant node1, a slave replica on participant node2.  When node1
> >> dies,  in callback computeResourceMapping() I retrieve P1’s replicas:
> >>
> >> Map<ParticipantId, State> replicas =
> >> currentState.getCurrentStateMap(resourceId, partitionId);
> >>
> >>
> >> Here the “replicas” includes only node2,  there is no entry for node1.
> >>
> >> However, I want to know all replicas including dead ones, so that I
> >> can know that a master replica is gone and I should failover to an
> >> existing slave, instead of starting a new master.
> >>
> >>
> >> Appreciate any comments!
> >>
> >>
> >> -Neutron
> >>
>


Re: want to get dead replicas in USER_DEFINED rebalance callback

2016-04-29 Thread kishore g
Hi,

Current state will not show dead replicas. You need to use previous
idealstate to derive that info. The logic will be something like this

computeResource(.) {
  List instances =
previousIdealState.getInstancesForPartition(P0)
foreach instance
  if(!liveInstances.contain(instance)){
//NEED TO ASSIGN ANOTHER INSTANCE TO FOR THIS PARTITION
  }
}

This allows your logic to be idempotent and not depend on incremental
changes.

thanks,
Kishore G

On Thu, Apr 28, 2016 at 4:27 PM, Neutron sharc <neutronsh...@gmail.com>
wrote:

> Hi team,
>
> in USER_DEFINED rebalance mode, the callback computeResourceMapping()
> accepts a “currentState”.  Does this variable include replicas on a
> dead participant ?
>
> For example, my resource has a partition P1 master replica on
> participant node1, a slave replica on participant node2.  When node1
> dies,  in callback computeResourceMapping() I retrieve P1’s replicas:
>
> Map<ParticipantId, State> replicas =
> currentState.getCurrentStateMap(resourceId, partitionId);
>
>
> Here the “replicas” includes only node2,  there is no entry for node1.
>
> However, I want to know all replicas including dead ones, so that I
> can know that a master replica is gone and I should failover to an
> existing slave, instead of starting a new master.
>
>
> Appreciate any comments!
>
>
> -Neutron
>


Re: error from HelixStateTransitionHandler

2016-04-25 Thread kishore g
afd816, errorMsg:
>
> org.apache.helix.messaging.handling.HelixStateTransitionHandler$HelixStateMismatchException:
> Current state of stateModel does not match the fromState in Message,
> Current State:MASTER, message expected:SLAVE, partition: Pool0_3,
> from: host1_admin, to: host1_disk3
>
> [ERROR 2016-04-25 20:13:48,560
> org.apache.helix.messaging.handling.HelixTask:143] Message execution
> failed. msgId: 35f504ce-b332-4fd4-a893-1a400047621e, errorMsg:
>
> org.apache.helix.messaging.handling.HelixStateTransitionHandler$HelixStateMismatchException:
> Current state of stateModel does not match the fromState in Message,
> Current State:MASTER, message expected:SLAVE, partition: Pool0_0,
> from: host1_admin, to: host1_disk1
>
> [ERROR 2016-04-25 20:13:48,573
> org.apache.helix.messaging.handling.HelixStateTransitionHandler:385]
> Skip internal error. errCode: ERROR, errMsg: Current state of
> stateModel does not match the fromState in Message, Current
> State:MASTER, message expected:SLAVE, partition: Pool0_2, from:
> host1_admin, to: host1_disk2
>
> [ERROR 2016-04-25 20:13:48,576
> org.apache.helix.messaging.handling.HelixStateTransitionHandler:385]
> Skip internal error. errCode: ERROR, errMsg: Current state of
> stateModel does not match the fromState in Message, Current
> State:MASTER, message expected:SLAVE, partition: Pool0_0, from:
> host1_admin, to: host1_disk1
>
> [ERROR 2016-04-25 20:13:48,577
> org.apache.helix.messaging.handling.HelixStateTransitionHandler:385]
> Skip internal error. errCode: ERROR, errMsg: Current state of
> stateModel does not match the fromState in Message, Current
> State:MASTER, message expected:SLAVE, partition: Pool0_3, from:
> host1_admin, to: host1_disk3
>
> [ERROR 2016-04-25 20:13:48,580
> org.apache.helix.messaging.handling.HelixTask:143] Message execution
> failed. msgId: 7573c64d-e055-4eb1-b845-e22c82542437, errorMsg:
>
> org.apache.helix.messaging.handling.HelixStateTransitionHandler$HelixStateMismatchException:
> Current state of stateModel does not match the fromState in Message,
> Current State:MASTER, message expected:SLAVE, partition: Pool0_1,
> from: host1_admin, to: host1_disk1
>
> [ERROR 2016-04-25 20:13:48,594
> org.apache.helix.messaging.handling.HelixStateTransitionHandler:385]
> Skip internal error. errCode: ERROR, errMsg: Current state of
> stateModel does not match the fromState in Message, Current
> State:MASTER, message expected:SLAVE, partition: Pool0_1, from:
> host1_admin, to: host1_disk1
>
>
> [INFO  2016-04-25 20:13:51,517 com.hcd.hcdadmin.HcdExternalView:68]
>
> Cluster TryHelixCluster1, show external view of LUNs
>
>
> host1_disk1 host1_disk2 host1_disk3
>
> Pool0_0 M S S
>
> Pool0_1 M S S
>
> Pool0_2 S M S
>
> Pool0_3 S S M
>
> On Sat, Apr 23, 2016 at 11:35 PM, kishore g <g.kish...@gmail.com> wrote:
> > How many resources do you have. Partition names must be unique across the
> > entire cluster. Can you also paste the idealstate for the resources
> >
> > On Sat, Apr 23, 2016 at 10:39 PM, Neutron sharc <neutronsh...@gmail.com>
> > wrote:
> >
> >> Hi Helix team,
> >>
> >> I keep seeing this error from HelixStateTransitionHandler when the
> >> state machine is running.  It seems a partition's actual state doesn't
> >> match with the state marked in controller message.  What are the usual
> >> causes?  I'm using helix
> >> 0.7.1.  Here is my maven pom.xml:
> >>
> >> 
> >> org.apache.helix
> >> helix-core
> >> 0.7.1
> >> 
> >>
> >>
> >>
> >> [ERROR 2016-04-21 19:51:09,943
> >> org.apache.helix.messaging.handling.HelixStateTransitionHandler:118]
> >> Current state of stateModel does not match the fromState in Message,
> >> Current State:MASTER, message expected:SLAVE, partition:
> >> host1_Pool0_0, from: host1_admin, to: host1_disk1
> >>
> >> [ERROR 2016-04-21 19:51:09,959
> >> org.apache.helix.messaging.handling.HelixTask:143] Message execution
> >> failed. msgId: 26c891b8-dd81-4e0c-8b99-6c62b856db5f, errorMsg:
> >>
> >>
> org.apache.helix.messaging.handling.HelixStateTransitionHandler$HelixStateMismatchException:
> >> Current state of stateModel does not match the fromState in Message,
> >> Current State:MASTER, message expected:SLAVE, partition:
> >> host1_Pool0_0, from: host1_admin, to: host1_disk1
> >>
> >> [ERROR 2016-04-21 19:51:09,975
> >> org.apache.helix.messaging.handling.HelixStateTransitionHandler:385]
> >> Skip internal error. errCode: ERROR, errMsg: Current state of
> >> stateModel does not match the fromState in Message, Current
> >> State:MASTER, message expected:SLAVE, partition: host1_Pool0_0, from:
> >> host1_admin, to: host1_disk1
> >>
> >>
> >> Another problem I see is:  my ideal state defines a partition has 3
> >> replicas, but the resource's external view shows sometime that a
> >> partition has 4 replicas.
> >>
> >>
> >> Any hints?  Thanks!
> >>
>


Re: error from HelixStateTransitionHandler

2016-04-24 Thread kishore g
How many resources do you have. Partition names must be unique across the
entire cluster. Can you also paste the idealstate for the resources

On Sat, Apr 23, 2016 at 10:39 PM, Neutron sharc 
wrote:

> Hi Helix team,
>
> I keep seeing this error from HelixStateTransitionHandler when the
> state machine is running.  It seems a partition's actual state doesn't
> match with the state marked in controller message.  What are the usual
> causes?  I'm using helix
> 0.7.1.  Here is my maven pom.xml:
>
> 
> org.apache.helix
> helix-core
> 0.7.1
> 
>
>
>
> [ERROR 2016-04-21 19:51:09,943
> org.apache.helix.messaging.handling.HelixStateTransitionHandler:118]
> Current state of stateModel does not match the fromState in Message,
> Current State:MASTER, message expected:SLAVE, partition:
> host1_Pool0_0, from: host1_admin, to: host1_disk1
>
> [ERROR 2016-04-21 19:51:09,959
> org.apache.helix.messaging.handling.HelixTask:143] Message execution
> failed. msgId: 26c891b8-dd81-4e0c-8b99-6c62b856db5f, errorMsg:
>
> org.apache.helix.messaging.handling.HelixStateTransitionHandler$HelixStateMismatchException:
> Current state of stateModel does not match the fromState in Message,
> Current State:MASTER, message expected:SLAVE, partition:
> host1_Pool0_0, from: host1_admin, to: host1_disk1
>
> [ERROR 2016-04-21 19:51:09,975
> org.apache.helix.messaging.handling.HelixStateTransitionHandler:385]
> Skip internal error. errCode: ERROR, errMsg: Current state of
> stateModel does not match the fromState in Message, Current
> State:MASTER, message expected:SLAVE, partition: host1_Pool0_0, from:
> host1_admin, to: host1_disk1
>
>
> Another problem I see is:  my ideal state defines a partition has 3
> replicas, but the resource's external view shows sometime that a
> partition has 4 replicas.
>
>
> Any hints?  Thanks!
>


Re: [jira] [Created] (HELIX-628) ZKHelixAdmin silently fails to fully cleanup the ZK structure

2016-02-17 Thread kishore g
This can happen when there all instances (participants/controllers) were
not shutdown in the previous run. From your logs, it looks like there was a
CONTROLLER that was still running when the new yarn application was started.



On Tue, Feb 16, 2016 at 11:08 PM, Joel Baranick (JIRA) 
wrote:

> Joel Baranick created HELIX-628:
> ---
>
>  Summary: ZKHelixAdmin silently fails to fully cleanup the ZK
> structure
>  Key: HELIX-628
>  URL: https://issues.apache.org/jira/browse/HELIX-628
>  Project: Apache Helix
>   Issue Type: Bug
> Affects Versions: 0.6.x
> Reporter: Joel Baranick
>
>
> For some reason, the ZKHelixAdmin silently fails to fully cleanup the ZK
> structure corresponding to the Helix cluster instance even if it is
> configured to do the cleanup before everything else starts up. This causes
> the Yarn application to fail to start.
>
> {code:title=Shutdown|borderStyle=solid}
> 2016-02-17 06:25:01 UTC INFO  [Thread-4]
> gobblin.yarn.GobblinYarnAppLauncher  301 - Stopping the
> GobblinYarnAppLauncher
> 2016-02-17 06:25:01 UTC INFO  [Thread-4]
> org.apache.helix.messaging.DefaultMessagingService  84 - Send 1 messages
> with criteria instanceName=%resourceName=%partitionName=%partitionState=%
> 2016-02-17 06:25:02 UTC INFO  [LogCopier STOPPING]
> gobblin.util.ExecutorsUtils  125 - Attempting to shutdown ExecutorService:
> java.util.concurrent.ScheduledThreadPoolExecutor@73240b61[Shutting down,
> pool size = 1, active threads = 0, queued tasks = 0, completed tasks = 1862]
> 2016-02-17 06:25:02 UTC INFO  [LogCopier STOPPING]
> gobblin.util.ExecutorsUtils  144 - Successfully shutdown ExecutorService:
> java.util.concurrent.ScheduledThreadPoolExecutor@73240b61[Terminated,
> pool size = 0, active threads = 0, queued tasks = 0, completed tasks = 1862]
> 2016-02-17 06:25:02 UTC INFO  [JobExecutionInfoServer STOPPING]
> gobblin.rest.JobExecutionInfoServer  94 - Stopping the job execution
> information server
> Shutting down
> 2016-02-17 06:25:02 UTC INFO  [AdminWebServer STOPPING]
> org.eclipse.jetty.server.AbstractConnector  306 - Stopped
> ServerConnector@35e0c350{HTTP/1.1}{localhost:8280}
> 2016-02-17 06:25:02 UTC INFO  [Thread-4] gobblin.util.ExecutorsUtils  125
> - Attempting to shutdown ExecutorService:
> java.util.concurrent.Executors$DelegatedScheduledExecutorService@185aaf1f
> 2016-02-17 06:25:02 UTC INFO  [Thread-4] gobblin.util.ExecutorsUtils  144
> - Successfully shutdown ExecutorService:
> java.util.concurrent.Executors$DelegatedScheduledExecutorService@185aaf1f
> 2016-02-17 06:25:02 UTC INFO  [Thread-4]
> org.apache.helix.manager.zk.ZKHelixManager  546 - disconnect
> ip-169-0-0-1(SPECTATOR) from GobblinYarn
> 2016-02-17 06:25:02 UTC INFO  [Thread-4]
> org.apache.helix.messaging.handling.HelixTaskExecutor  679 - Shutting down
> HelixTaskExecutor
> 2016-02-17 06:25:02 UTC INFO  [Thread-4]
> org.apache.helix.messaging.handling.HelixTaskExecutor  443 - Reset
> HelixTaskExecutor
> 2016-02-17 06:25:02 UTC INFO  [Thread-4]
> org.apache.helix.messaging.handling.HelixTaskExecutor  453 - Reset exectuor
> for msgType: TASK_REPLY, pool:
> java.util.concurrent.ThreadPoolExecutor@3f197a46[Running, pool size = 0,
> active threads = 0, queued tasks = 0, completed tasks = 0]
> 2016-02-17 06:25:02 UTC INFO  [Thread-4]
> org.apache.helix.messaging.handling.HelixTaskExecutor  397 - Shutting down
> pool: java.util.concurrent.ThreadPoolExecutor@3f197a46[Running, pool size
> = 0, active threads = 0, queued tasks = 0, completed tasks = 0]
> 2016-02-17 06:25:02 UTC INFO  [Thread-4]
> org.apache.helix.messaging.handling.HelixTaskExecutor  684 - Shutdown
> HelixTaskExecutor finished
> 2016-02-17 06:25:02 UTC INFO  [Thread-4]
> org.apache.helix.manager.zk.ZkClient  130 - Closing zkclient:
> State:CONNECTED Timeout:3 sessionid:0xd452eb397b640065 local:/
> 169.0.0.1:51319 remoteserver:ip-138-0-0-1.ec2.internal/138.0.0.2:2181
> lastZxid:60129782948 xid:17 sent:140 recv:140 queuedpkts:0 pendingresp:0
> queuedevents:0
> 2016-02-17 06:25:02 UTC INFO  [ZkClient-EventThread-17-zk.server:2181]
> org.I0Itec.zkclient.ZkEventThread  82 - Terminate ZkClient event thread.
> 2016-02-17 06:25:02 UTC INFO  [main-EventThread]
> org.apache.zookeeper.ClientCnxn$EventThread  512 - EventThread shut down
> 2016-02-17 06:25:02 UTC INFO  [Thread-4] org.apache.zookeeper.ZooKeeper
> 684 - Session: 0xd452eb397b640065 closed
> 2016-02-17 06:25:02 UTC INFO  [Thread-4]
> org.apache.helix.manager.zk.ZkClient  157 - Closed zkclient
> 2016-02-17 06:25:02 UTC INFO  [Thread-4]
> org.apache.helix.manager.zk.ZKHelixManager  570 - Cluster manager:
> ip-169-0-0-1 disconnected
> 2016-02-17 06:25:02 UTC INFO  [Thread-4]
> gobblin.yarn.GobblinYarnAppLauncher  722 - Deleting application working
> directory hdfs://
> ec2-145-0-0-1.compute-1.amazonaws.com:9000/user/yarn/GobblinYarn/application_1455654714320_0004
> {code}
>
> 

Re: Mapping from a resource partition to a logical data partition

2016-02-03 Thread kishore g
This is needed for another project I am working on. There is no reason for
Helix to depend on this convention. I will fix this.



On Wed, Feb 3, 2016 at 5:51 PM, ShaoFeng Shi  wrote:

> Hello,
>
> I'm trying to use Helix (0.7.1) to manage our resource partitions on a
> cluster. My scenario is, every 5 minutes a partition will be added; What I
> do now is, get the ideal state, +1 for the partition number and then update
> it with ZKHelixAdmin. With a rebalance action, the new partition will be
> created and assigned to an instance. What the instance can get from Helix
> is the resource name and partition id. To map the parition to my logical
> data, I maintained a mapping table in our datastore, which looks like:
>
> {
>   "resource_0": 20160101_ 201601010005,
>   "resource_1": 201601010005_ 201601010010,
>   ...
>   "resource_11": 201601010055_ 201601010100
> }
>
> Here it has 12 partitions; Now I want to discard some old partitions, say
> the first 6 partitions;  It seems in Helix the partitions must start from
> 0, so with an update on the IdealState, set # of partitions to 6, the new
> partitions on the cluster would become to:
>
> resource_0,resource_1, ..., resource_5,
>
> To make sure the partitions wouldn't be wrongly mapped, I need update my
> mapping table before the rebalance. While that may not ensure the atomic
> between the two updates.
>
> So my question is, what's the suggested way to do the resource partition
> mapping? does Helix allow user to specify additional information on a
> partition (if have this, I don't need maintain the mapping outside)? Can we
> have some simple APIs like addPartition(String parititionName),
> dropParitition(String partitionName), just like that for resource? The
> numeric paritition id can be an internal ID and not exposed to user.
>
> I guess many entry users will have such questions. Just raise this for a
> broad discussion; Any comment and suggestion is welcomed. Thanks for your
> time!
>
>
> --
> Best regards,
>
> Shaofeng Shi
>


Re: "listFields" always be empty under FULL_AUTO mode?

2016-02-02 Thread kishore g
Hi ShaoFeng*,*

The rebalancer works ensures even distribution within a resource. Across
resources, it uses hashing based on resource names.
So if you add few more resources, you might start seeing the distribution
change.

Another option is to add one partition every 5 minutes to the same resource.

thanks,
Kishore G

On Tue, Feb 2, 2016 at 6:43 AM, ShaoFeng Shi <shaofeng...@apache.org> wrote:

> Hello Helix developers,
>
> I'm trying to use Helix (0.7.1) to manage some resources; every 5 minutes,
> such a resource will be added to the cluster; Each resource only has 1
> partition, and it's state model is "LEADER_STANDBY" (the leader will do
> some exclusive job); the rebalance mode is FULL_AUTO. Now the cluster has
> two instances.  The interesting thing I found is, all the LEADERs are
> assigned to the first node of my cluster, and the other cluster carries all
> STANDBYs. Although when I disable the first node, all the LEADERs are
> transferred to the second node automatically, I expect there could be some
> load balance among them: half leaders on the first node, the others on the
> second. Did I do something wrong, or this is expected?
>
> Another issue I faced is, on the helix-ui, it can show all resources, but
> when selecting a resource, the "partitions" tab couldn't show which
> instances it be assigned to; the only message is "No partitions of
>   are assigned!"
>
> I checked the UI code, it indicates the reason is the "listFields" is
> empty. That matches with what I see with the helix-admin.sh
> --listResourceInfo:
>
> IdealState for Resource_Stream_145439190_145439220:
> {
>   "id" : "Resource_Stream_145439190_145439220",
>   "mapFields" : {
> "Resource_Stream_145439190_145439220" : {
> }
>   },
>   "listFields" : {
> "Resource_Stream_145439190_145439220" : [ ]
>   },
>   "simpleFields" : {
> "IDEAL_STATE_MODE" : "AUTO_REBALANCE",
> "NUM_PARTITIONS" : "1",
> "REBALANCE_MODE" : "FULL_AUTO",
> "REPLICAS" : "3",
> "STATE_MODEL_DEF_REF" : "LeaderStandby",
> "STATE_MODEL_FACTORY_NAME" : "DEFAULT"
>   }
> }
>
> ExternalView for Resource_Stream_145439190_145439220:
> {
>   "id" : "Resource_Stream_145439190_145439220",
>   "mapFields" : {
> "Resource_Stream_145439190_145439220" : {
>   "kylin-dev3_8080" : "LEADER",
>   "kylin-dev4_8080" : "STANDBY"
> }
>   },
>   "listFields" : {
>   },
>   "simpleFields" : {
> "BUCKET_SIZE" : "0"
>   }
> }
>
> I checked the code, it seems for FULL_AUTO, the values for "listFields" and
> "mapFields" are empty (at least at the begining). Will it be updated at
> some point of time, or how could I trigger that?
>
> Thanks in advance for your help; I'm new to Helix so may not get some
> concepts correctly. Just correct me if I'm wrong, thanks!
>
> --
> Best regards,
>
> Shaofeng Shi
>


Re: Run helix-ui in Tomcat

2016-01-31 Thread kishore g
Will this work in this scenario.
https://github.com/rvs-fluid-it/wizard-in-a-box

On Sun, Jan 31, 2016 at 8:27 AM, Greg Brandt  wrote:

> Hey Shaofeng,
>
> You can try using this helper class if you want to deploy the helix-ui in
> a war:
>
>
> https://github.com/apache/helix/blob/master/helix-ui/src/main/java/org/apache/helix/ui/util/DropWizardApplicationRunner.java
>
> Since the helix-ui module is intended to be packaged and run as a shaded
> jar from command line, you might have to be a little careful of dependency
> versions if your tomcat server has a complicated classpath.
>
> Also here is a link to the latest drop wizard configuration docs
> https://dropwizard.github.io/dropwizard/0.7.1/docs/manual/configuration.html
>
> -Greg
>
> > On Jan 31, 2016, at 6:08 AM, ShaoFeng Shi 
> wrote:
> >
> > Hello,
> >
> > Just want to get some advice on how to run helix-ui in a Tomcat instance,
> > is there any doc or guidance? Our app already runs in a tomcat, if
> helix-ui
> > can be packaged as a standard war, that would be very easy for ship and
> > deploy. I checked this page:
> > https://github.com/apache/helix/tree/master/helix-ui , but the
> “Dropwizard
> > Configuration Reference” link is reporting 404 error now.
> >
> > Thanks for any suggestion!
> >
> > --
> > Best regards,
> >
> > Shaofeng Shi
>


Re: State transitions of partitions and Distributed Cluster controllers

2016-01-27 Thread kishore g
thanks for sending it again.

I looked at the code, even though the retry is handled on the participant.
Looks like we are not setting it for state transition message. We do have
this ability to set it for custom message type.

Fix is easy, we just need to set message.setRetryCount in this class

https://github.com/apache/helix/blob/9e51cb7bdf8424df46c6fa353e7c80d984c21193/helix-core/src/main/java/org/apache/helix/controller/stages/MessageGenerationStage.java

We can read the retry count from cluster config.

There was another email I had recently sent with instructions to set up
distributed controller. In short the steps are

helixadmin create-cluster super_cluster
helixadmin addInstance super_cluster  controller1
helixadmin addInstance super_cluster  controller2
helixadmin addInstance super_cluster  controller3

start the three controller in distributed mode and provide super_cluster as
the cluster name.

Now any time you create a cluster, you can add that cluster as a resource
in the super_cluster. One of the controllers will automatically start
managing the new cluster. For e.g.
helixadmin create-cluster cluster1
helixadmin addresource super-cluster cluster1 AUTO mode leaderstandbymodel

I don't remember the exact commands on top of my head but it should look
something like that.

Yes, reset will be called when you lose zk session. It will also be invoked
when a partition goes to ERROR state and you want to get back to OFFLINE
state. ( I am not 100% sure if reset api is invoked or ERROR to OFFLINE
transition is invoked). Jason might be able to answer that.

Hope that helps.


On Wed, Jan 27, 2016 at 10:51 AM, Subramanian Raghunathan <
subramanian.raghunat...@integral.com> wrote:

> Hi Helix Team ,
>
>
>
> I am evaluating helix as a cluster management framework. I
> believe it’s very modular , highly customizable with a variety of out of
> box capabilities. Kudos to the team !
>
>
>
> I have the below queries :
>
>
>
> 1)  How to configure the number of retries  in state transition
> handlers ?
>
> http://markmail.org/message/vgc4nksocolqiqx5
>
> I referred to the this particular mail conversion : “you
> can configure the number of retries before a transition is considered as
> failed”
>
>
>
> 2)   Please point me to an example/interfaces of starting a
> distributed cluster controller and how to add the various clusters that the
> controllers is supposed to manage.
>
>
>
> 3)  What would be the event life cycle of the reset() method in
> TransitionHandler
>
> a.   Believe this gets called if zookeeper client session is lost or
> there’s an update to the cluster configuration
>
>
>
> Note: I am using the “helix-0.7.1” version.
>
>
>
> Thanks & Regards,
>
> Subramanian Raghunathan
>


Re: What is correct configurations to build Helix master with unit tests.

2016-01-05 Thread kishore g
The ulimit seems to very small. Can you set it to something like 5 or
unlimited if possible.

thanks,
Kishore G

On Tue, Jan 5, 2016 at 1:21 PM, Shameera Rathnayaka <shame...@apache.org>
wrote:

> Hi Devs,
>
> I am new to Apache Helix and pretty interesting to the project. I tried to
> building Helix master with "man clean install" as mention in here[1], what
> I want is to build the project with unit tests. But build fails with
> following error, I doubt that set of integration tests also run with this
> command.
>
> java.lang.OutOfMemoryError: unable to create new native thread
> at java.lang.Thread.start0(Native Method)
> at java.lang.Thread.start(Thread.java:714)
>
> I ran "ulimit -u" and output is 709. What is the correct value I need to
> set to resolve above issue. Are there any other configuration parameter i
> need to care about to avoid build issues like this?
>
> As a distributed system enthusiastic I would like to contribute for the
> project. I have fair background of distributed systems with what we (Group
> of 4 students, including myself) did for our undergraduate project,A
> Distributed System Management Framework with User-Define Management Logic.
> Here is the paper we published on ITNG conference (
> http://dl.acm.org/citation.cfm?id=2497038). and I am contributing to
> Apache
> Airavata which is also distributed system framework for Science Gateways.
>
> [1] https://issues.apache.org/jira/browse/HELIX-77
>


Re: dropwizard-helix

2015-12-07 Thread kishore g
This is interesting. I will try to read up some more about this framework.
Are there any other frameworks that support similar behavior. Is this
comparable to spring, guice etc?



On Sat, Dec 5, 2015 at 3:49 PM, Greg Brandt  wrote:

> Hey,
>
> If you're not familiar with it, Dropwizard (http://www.dropwizard.io/) is
> a really good Java framework for building web applications. It has
> basically everything you'd need to build a web application - JDBI (or
> Hibernate), views, authentication, etc. and leverages solid existing
> components (Jetty, Jersey, Jackson, Metrics).
>
> One thing that it is lacking is the ability to easily deploy distributed
> applications, so I created a module that aims to make it very simple to use
> Helix to accomplish that: https://github.com/brandtg/dropwizard-helix
>
> Currently it only does really simple service discovery, but I think there
> are a lot of complementary features between the two frameworks that we
> could explore. For example, Dropwizard's task interface (
> http://www.dropwizard.io/0.9.1/docs/manual/core.html#tasks) could
> leverage Helix's task framework to make running distributed tasks easy.
>
> My idea is to basically offer the Helix recipes as Dropwizard bundles,
> requiring no code from the user other than adding the bundle / config, then
> have an advanced mode where one can basically just provide state transition
> handler / factory class for things like master / slave. But the bundle
> would take care of all the details like computing instance name,
> connecting, etc.
>
> You can see it in use here:
> https://github.com/brandtg/dropwizard-helix/blob/master/src/test/java/com/github/brandtg/discovery/TestHelixServiceDiscoveryBundle.java
>
> -Greg
>


Re: A possible leak of ZkClient object in helix?

2015-10-21 Thread kishore g
Hi Subbu,

Can you show the jmap output that showed that the number of zkclients
increased.

Jason, simulating this is straight forward.

- start ZK
- set up a dummy cluster
- start a dummy server/broker and controller
- create a resource
- stop ZKServer and restart it after setting jute.maxBuffer to very low
number like 10.

This will make the client go into connecting/disconnecting loop.  All the
while we can do

jmap -histo:live and see that the zkClient object keeps increasing.

I looked at the code and cant find any place where we would be creating new
zkclients. I dont think zkproperty store is needed to reproduce this issue.






On Wed, Oct 21, 2015 at 9:25 PM, Zhen Zhang  wrote:

> Hi Subbu,
>
> I don't think it's ZkClient leak. Each ZkClient has a ZkEventThread inside:
>
> https://github.com/sgroschupf/zkclient/blob/master/src/main/java/org/I0Itec/zkclient/ZkClient.java#L70
>
> The ZkEventThread has a unique id:
>
> https://github.com/sgroschupf/zkclient/blob/master/src/main/java/org/I0Itec/zkclient/ZkEventThread.java#L59
>
> From the log it seems the ZkEventThread remains the same "
> ZkClient-EventThread-34", so there is only one zk client.
>
> One reason I can think of is zk watch leak. This jira describes the issue:
> https://issues.apache.org/jira/browse/HELIX-527
>
> Does restart the controller resolve the issue?
>
> Thanks,
> Jason
>
> On Wed, Oct 21, 2015 at 7:09 PM, Subbu Subramaniam <
> ssubraman...@linkedin.com.invalid> wrote:
>
> > Hi,
> >
> > We are using helix-0.6.5.22 version of helix for pinot development (
> > https://github.com/linkedin/pinot)
> >
> > The pinot controller  constructs a ZkHelixPropertyStore() and adds a
> > listener to listen to changes on all nodes under the path specified.
> >
> > If the zookeeper server is restarted, then the controller tries to
> > re-connect and send the entire watch list to the server. Since there is a
> > 1M limit on the size of the message that the server can receive, the
> server
> > drops the connection.
> >
> > At this point, we see that the controller goes into a spin, creating a
> new
> > ZkClient object as the connection moves between SyncConnected and
> > Disconnected states.
> >
> > I have not tried to reproduce this problem independent of
> pinot-controller,
> > but I suspect the following steps could do it.
> >
> >
> >- Start zk server
> >- Start a program that calls new ZkHelixPropertyStore() that ends up
> >setting watches on a bunch of nodes that may exceed 10k bytes (total
> of
> > all
> >path name lengths).
> >- Stop the zk server
> >- Restart the zk server setting jute.maxbuffer to 10k.
> >- The client will probably go into a spin leaking ZkClient objects.
> >
> >
> > The objects seem to be in a LinkedBlockingQueue, since jvisualvm also
> shows
> > LinkedBlockingQueue$Node increasing proportionately by the same number.
> >
> > Here is a sequence of error messages that may help:
> >
> > 2015/10/21 17:25:18.780 INFO [ClientCnxn]
> [main-SendThread(localhost:2181)]
> > [pinot-controller] [] Unable to read additional data from server
> sessionid
> > 0x1508cecdce10001, likely server has closed socket, closing socket
> > connection and attempting reconnect
> > 2015/10/21 17:25:18.880 INFO [ZkClient] [main-EventThread]
> > [pinot-controller] [] zookeeper state changed (Disconnected)
> > 2015/10/21 17:25:18.880 INFO [ZKHelixManager]
> > [ZkClient-EventThread-34-localhost:2181] [pinot-controller] []
> > KeeperState:Disconnected, disconnectedSessionId: 1508cecdce10001,
> instance:
> > ssubrama-ld1.linkedin.biz_8862, type: CONTROLLER
> > 2015/10/21 17:25:19.741 INFO [ClientCnxn]
> [main-SendThread(localhost:2181)]
> > [pinot-controller] [] Opening socket connection to server localhost/
> > 127.0.0.1:2181
> > 2015/10/21 17:25:19.741 INFO [ClientCnxn]
> [main-SendThread(localhost:2181)]
> > [pinot-controller] [] Socket connection established to localhost/
> > 127.0.0.1:2181, initiating session
> > 2015/10/21 17:25:19.743 INFO [ClientCnxn]
> [main-SendThread(localhost:2181)]
> > [pinot-controller] [] Session establishment complete on server localhost/
> > 127.0.0.1:2181, sessionid = 0x1508cecdce10001, negotiated timeout =
> 3
> > 2015/10/21 17:25:19.743 INFO [ZkClient] [main-EventThread]
> > [pinot-controller] [] zookeeper state changed (SyncConnected)
> > 2015/10/21 17:25:19.743 INFO [ZKHelixManager]
> > [ZkClient-EventThread-34-localhost:2181] [pinot-controller] []
> KeeperState:
> > SyncConnected, zookeeper:State:CONNECTED Timeout:3
> > sessionid:0x1508cecdce10001 local:/127.0.0.1:50492
> remoteserver:localhost/
> > 127.0.0.1:2181 lastZxid:730 xid:465 sent:476 recv:476 queuedpkts:0
> > pendingresp:1 queuedevents:0
> > 2015/10/21 17:25:19.744 INFO [ClientCnxn]
> [main-SendThread(localhost:2181)]
> > [pinot-controller] [] Unable to read additional data from server
> sessionid
> > 0x1508cecdce10001, likely server has closed socket, closing socket
> > connection and attempting 

Re: [jira] [Commented] (HELIX-609) Message sending/broadcasting to participants do not work for the Task Framework

2015-10-03 Thread kishore g
Hi Yinan,

The PR is not getting applied cleanly. can you please rebase from master? I
will apply the patch after that.

thanks,
Kishore G

On Sat, Oct 3, 2015 at 5:41 PM, kishore g <g.kish...@gmail.com> wrote:

> Will merge it tonight
>
> On Fri, Oct 2, 2015 at 3:41 PM, ASF GitHub Bot (JIRA) <j...@apache.org>
> wrote:
>
>>
>> [
>> https://issues.apache.org/jira/browse/HELIX-609?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14941906#comment-14941906
>> ]
>>
>> ASF GitHub Bot commented on HELIX-609:
>> --
>>
>> Github user liyinan926 commented on the pull request:
>>
>> https://github.com/apache/helix/pull/34#issuecomment-145173417
>>
>> @kishoreg just wondering when can you merge #34 and #35? Are they
>> going into 0.7.2?
>>
>>
>> > Message sending/broadcasting to participants do not work for the Task
>> Framework
>> >
>> ---
>> >
>> > Key: HELIX-609
>> > URL: https://issues.apache.org/jira/browse/HELIX-609
>> > Project: Apache Helix
>> >  Issue Type: Bug
>> >Reporter: Yinan Li
>> >
>>
>>
>>
>>
>> --
>> This message was sent by Atlassian JIRA
>> (v6.3.4#6332)
>>
>
>


Helix report

2015-09-10 Thread kishore g
Can any one help me on the quarterly Helix report? It was due yesterday.

thanks,
Kishore G


Re: Helix maven archetype

2015-08-22 Thread kishore g
Wow this is awesome. Mind adding a wiki and a pointer to quick start.md
On Aug 22, 2015 12:40 PM, Vinoth Chandar vin...@uber.com wrote:

 +1 pretty nice

 Thanks,
 Vinoth




 On Sat, Aug 22, 2015 at 11:58 AM -0700, Greg Brandt 
 brandt.g...@gmail.com wrote:

 Hey,

 I put together a Helix archetype that new users can use to get started
 (or old users who don't want to remember everything when starting a new
 application...)

 It's available here: https://github.com/brandtg/helix-archetype

 You can generate like this:

 mvn archetype:generate \
   -DarchetypeGroupId=org.apache.helix \
   -DarchetypeArtifactId=helix-archetype \
   -DarchetypeVersion=1.0-SNAPSHOT \
   -DgroupId=com.example \
   -DartifactId=my-app \
   -Dname=MyApp \
   -DinteractiveMode=false

 Then you'll get a project structure like this:

  tree
 .
 ├── pom.xml
 └── src
 └── main
 ├── java
 │   ├── MyAppMain.java
 │   ├── participant
 │   │   ├── MyAppParticipant.java
 │   │   ├── MyAppStateTransitionHandler.java
 │   │   └── MyAppStateTransitionHandlerFactory.java
 │   └── spectator
 │   └── MyAppSpectator.java
 └── resources
 └── log4j.xml

 And an executable that contains all your cluster roles, and wraps
 ClusterSetup:

  java -jar target/my-app-1.0-SNAPSHOT.jar
 usage: mode args...

 Modes are: participant, controller, zookeeper, setup.

 Let me know what you all think. If this is good, we could put it in trunk.

 Thanks,
 -Greg




Re: [jira] [Commented] (HELIX-597) HelixAdmin should have a method to drop a state model def.

2015-05-20 Thread kishore g
can you upload the diff to reviews.apache.org. It helps us to capture the
review comments without having to pull the code.

On Wed, May 20, 2015 at 8:54 PM, Vinayak Borkar (JIRA) j...@apache.org
wrote:


 [
 https://issues.apache.org/jira/browse/HELIX-597?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14553579#comment-14553579
 ]

 Vinayak Borkar commented on HELIX-597:
 --

 Added fix to branch vinayakb/597_dropStateModelDef. Please review and let
 me. know. If it looks good, I will merge it into master.

  HelixAdmin should have a method to drop a state model def.
  --
 
  Key: HELIX-597
  URL: https://issues.apache.org/jira/browse/HELIX-597
  Project: Apache Helix
   Issue Type: Improvement
   Components: helix-core
 Reporter: Vinayak Borkar
 
  Currently HelixAdmin has calls for adding and getting state model defs,
 but there is no call to remove or drop them.



 --
 This message was sent by Atlassian JIRA
 (v6.3.4#6332)



helix-ui dependency

2015-04-12 Thread kishore g
Helix-ui module has the following dependencies (close to 100).

Are the licenses are compatible. Also do we need all these jars, can we
eclude the ones we don't need?

validation-api-1.1.0.Final.jar
javax.servlet-api-3.1.0.jar
javax.ws.rs-api-2.0.1.jar
javax.annotation-api-1.2.jar
dropwizard-core-0.8.0.jar
dropwizard-util-0.8.0.jar
jackson-annotations-2.5.0.jar
guava-18.0.jar
jsr305-3.0.0.jar
joda-time-2.7.jar
dropwizard-jackson-0.8.0.jar
jackson-core-2.5.1.jar
jackson-databind-2.5.1.jar
jackson-datatype-jdk7-2.5.1.jar
jackson-datatype-guava-2.5.1.jar
jackson-module-afterburner-2.5.1.jar
jackson-datatype-joda-2.5.1.jar
slf4j-api-1.7.10.jar
logback-classic-1.1.2.jar
logback-core-1.1.2.jar
dropwizard-validation-0.8.0.jar
hibernate-validator-5.1.3.Final.jar
jboss-logging-3.1.3.GA.jar
classmate-1.0.0.jar
javax.el-3.0.0.jar
dropwizard-configuration-0.8.0.jar
jackson-dataformat-yaml-2.5.1.jar
snakeyaml-1.12.jar
commons-lang3-3.3.2.jar
dropwizard-logging-0.8.0.jar
metrics-logback-3.1.0.jar
metrics-core-3.1.0.jar
jul-to-slf4j-1.7.10.jar
log4j-over-slf4j-1.7.10.jar
jcl-over-slf4j-1.7.10.jar
jetty-util-9.2.9.v20150224.jar
dropwizard-metrics-0.8.0.jar
dropwizard-lifecycle-0.8.0.jar
jetty-server-9.2.9.v20150224.jar
jetty-http-9.2.9.v20150224.jar
jetty-io-9.2.9.v20150224.jar
dropwizard-jersey-0.8.0.jar
jersey-server-2.16.jar
jersey-common-2.16.jar
jersey-guava-2.16.jar
hk2-api-2.4.0-b09.jar
hk2-utils-2.4.0-b09.jar
aopalliance-repackaged-2.4.0-b09.jar
javax.inject-2.4.0-b09.jar
hk2-locator-2.4.0-b09.jar
javassist-3.18.1-GA.jar
osgi-resource-locator-1.0.1.jar
jersey-client-2.16.jar
jersey-media-jaxb-2.16.jar
jersey-metainf-services-2.16.jar
metrics-jersey2-3.1.0.jar
metrics-annotation-3.1.0.jar
jackson-jaxrs-json-provider-2.5.1.jar
jackson-jaxrs-base-2.5.1.jar
jackson-module-jaxb-annotations-2.5.1.jar
jersey-container-servlet-2.16.jar
jersey-container-servlet-core-2.16.jar
jetty-webapp-9.2.9.v20150224.jar
jetty-xml-9.2.9.v20150224.jar
jetty-servlet-9.2.9.v20150224.jar
jetty-security-9.2.9.v20150224.jar
jetty-continuation-9.2.9.v20150224.jar
dropwizard-servlets-0.8.0.jar
dropwizard-jetty-0.8.0.jar
metrics-jetty9-3.1.0.jar
jetty-servlets-9.2.9.v20150224.jar
metrics-jvm-3.1.0.jar
metrics-servlets-3.1.0.jar
metrics-healthchecks-3.1.0.jar
metrics-json-3.1.0.jar
argparse4j-0.4.4.jar
jetty-setuid-java-1.0.2.jar
dropwizard-assets-0.8.0.jar
dropwizard-views-freemarker-0.8.0.jar
dropwizard-views-0.8.0.jar
freemarker-2.3.21.jar
zookeeper-3.3.4.jar
jline-0.9.94.jar
jackson-core-asl-1.8.5.jar
jackson-mapper-asl-1.8.5.jar
commons-io-2.2.jar
commons-cli-1.2.jar
zkclient-0.1.jar
commons-math-2.1.jar
commons-codec-1.6.jar
testng-6.0.1.jar
junit-4.11.jar
hamcrest-core-1.3.jar
bsh-2.0b4.jar
jcommander-1.12.jar


Re: [VOTE] Apache Helix 0.6.5 Release [RC2]

2015-03-23 Thread kishore g
Trying to make the release. How do I get the release notes from JIRA?

thanks,
Kishore G

On Thu, Mar 19, 2015 at 10:20 AM, kishore g g.kish...@gmail.com wrote:

 Here is my +1 as well.

 On Thu, Mar 19, 2015 at 10:15 AM, Kanak Biscuitwala kana...@hotmail.com
 wrote:

 +1


 
  Date: Thu, 19 Mar 2015 08:39:39 -0700
  Subject: Re: [VOTE] Apache Helix 0.6.5 Release [RC2]
  From: nehzgn...@gmail.com
  To: dev@helix.apache.org
 
  +1
  On Mar 19, 2015 12:49 AM, kishore g g.kish...@gmail.com wrote:
 
  Hi,
 
  This is to call for a vote on releasing the following candidate as
 Apache
  Helix 0.6.5. This is the 8th release of Helix as an Apache project, as
 well
  as the 4th release as a top-level Apache project.
 
  Apache Helix is a generic cluster management framework that makes it
 easy
  to build partitioned and replicated, fault-tolerant and scalable
  distributed systems.
 
  Release notes:
  http://helix.apache.org/releasenotes/release-0.6.5.html
 
  Release artifacts:
  https://repository.apache.org/content/repositories/orgapachehelix-1005
 
  Distribution:
  * binaries:
  https://dist.apache.org/repos/dist/dev/helix/0.6.5/binaries/
  * sources:
  https://dist.apache.org/repos/dist/dev/helix/0.6.5/src/
 
  The 0.6.5 release tag:
 
 
 https://git-wip-us.apache.org/repos/asf?p=helix.git;a=tag;h=refs/tags/helix-0.6.5
 
  KEYS file available here:
  https://dist.apache.org/repos/dist/dev/helix/KEYS
 
  Please vote on the release. The vote will be open for at least 72
 hours.
 
  [+1] -- YES, release
  [0] -- No opinion
  [-1] -- NO, do not release
 
  Thanks,
  The Apache Helix Team
 






[VOTE] Apache Helix 0.6.5 Release [RC2]

2015-03-19 Thread kishore g
Hi,

This is to call for a vote on releasing the following candidate as Apache
Helix 0.6.5. This is the 8th release of Helix as an Apache project, as well
as the 4th release as a top-level Apache project.

Apache Helix is a generic cluster management framework that makes it easy
to build partitioned and replicated, fault-tolerant and scalable
distributed systems.

Release notes:
http://helix.apache.org/releasenotes/release-0.6.5.html

Release artifacts:
https://repository.apache.org/content/repositories/orgapachehelix-1005

Distribution:
* binaries:
https://dist.apache.org/repos/dist/dev/helix/0.6.5/binaries/
* sources:
https://dist.apache.org/repos/dist/dev/helix/0.6.5/src/

The 0.6.5 release tag:
https://git-wip-us.apache.org/repos/asf?p=helix.git;a=tag;h=refs/tags/helix-0.6.5

KEYS file available here:
https://dist.apache.org/repos/dist/dev/helix/KEYS

Please vote on the release. The vote will be open for at least 72 hours.

[+1] -- YES, release
[0] -- No opinion
[-1] -- NO, do not release

Thanks,
The Apache Helix Team


Re: [VOTE] Apache Helix 0.6.5 Release

2015-03-17 Thread kishore g
I think we will have to re do the release. I wont be able to do this today.
Will have start tomorrow evening.

Varun, this means the release won't be available until Friday :(

thanks,
Kishore G

On Tue, Mar 17, 2015 at 9:58 PM, Kanak Biscuitwala kana...@hotmail.com
wrote:

 I see that the ivy has been updated in the code, but the release has not
 been updated. I'm not sure if that's allowed.

 
  Date: Tue, 17 Mar 2015 15:00:13 -0700
  Subject: Re: [VOTE] Apache Helix 0.6.5 Release
  From: nehzgn...@gmail.com
  To: dev@helix.apache.org
 
  I've fixed the ivy issue as well as updated the release notes:
 
 https://issues.apache.org/jira/secure/ReleaseNote.jspa?projectId=12314020version=12328652
 
  Release Notes - Apache Helix - Version 0.6.5
 
  ** Bug
  * [HELIX-512] - add back HelixManager#getHealthReportCollector()
  interface to 0.6.x
  * [HELIX-514] - ZkBaseDataAccessor#set() should throw
  BadVersionException instead of return false in case of version mismatch
  * [HELIX-518] - Add integration tests to ensure helix tasks work as
  expected during master failover
  * [HELIX-519] - Add integration tests to ensure that kill-switch for
  Helix tasks work as expected
  * [HELIX-521] - Should not start
  GenericHelixController#ClusterEventProcessor in types other than
 CONTROLLER
  and CONTROLLER_PARTICIPANT
  * [HELIX-537] - org.apache.helix.task.TaskStateModel should have a
  shutdown method.
  * [HELIX-541] - Possible livelock in Helix controller
  * [HELIX-547] - AutoRebalancer may not converge in some rare situation
  * [HELIX-549] - Discarding Throwable exceptions makes threads
  unkillable.
  * [HELIX-550] - ZKHelixManager does not shutdown GenericHelixController
  threads.
  * [HELIX-552] - StateModelFactory#_stateModelMap should use both
  resourceName and partitionKey to map a state model
  * [HELIX-555] - ClusterStateVerifier leaks ZkClients.
  * [HELIX-559] - Helix web admin performance issues
  * [HELIX-562] - TaskRebalancer doesn't honor MaxAttemptsPerTask when
  FailureThreshold is larger than 0
  * [HELIX-563] - Throw more meaningful exceptions when
  AutoRebalanceStrategy#computePartitionAssignment inputs are invalid
  * [HELIX-572] - External view is recreated every time for bucketized
  resource
  * [HELIX-574] - fix bucketize resource bug in current state carryover
  * [HELIX-575] - Should not send FINALIZED callback when a bucketized
  resource is removed
  * [HELIX-579] - fix ivy files issue
 
  ** Improvement
  * [HELIX-524] - add getProgress() to Task interface
  * [HELIX-573] - Add support to compress/uncompress data on ZK
  * [HELIX-576] - Make StateModelFactory change backward compatible
 
  ** New Feature
  * [HELIX-546] - REST Admin APIs needed for helix job queue management
  * [HELIX-581] - Support deleting job from a job queue
 
  ** Task
  * [HELIX-539] - Add ivy file for helix-agent
 
  ** Test
  * [HELIX-580] - Fix test: TestBatchMessage#testSubMsgExecutionFail
 
  Thanks,
  Jason
 
 
  On Mon, Mar 16, 2015 at 11:22 PM, kishore g g.kish...@gmail.com wrote:
 
  Jason is working on the release notes. Does any one know if we have to
 re
  cut the release because of the ivy file?
 
  thanks,
  Kishore G
 
  On Mon, Mar 16, 2015 at 10:30 PM, Kanak Biscuitwala 
 kana...@hotmail.com
  wrote:
 
  A couple things:
 
 
  1. Release notes are missing. You can just link to this page if you
 want
  to update the site later, but it's currently not up-to-date:
 
 
 https://issues.apache.org/jira/browse/HELIX/fixforversion/12328652/?selectedTab=com.atlassian.jira.jira-projects-plugin:version-summary-panel
 
 
  2. Rat is failing because of this file:
  helix-agent/helix-agent-0.6.5-SNAPSHOT.ivy
 
 
  
  Date: Mon, 16 Mar 2015 10:19:27 -0700
  Subject: Re: [VOTE] Apache Helix 0.6.5 Release
  From: nehzgn...@gmail.com
  To: dev@helix.apache.org
 
  +1
 
  On Mon, Mar 16, 2015 at 12:15 AM, kishore g g.kish...@gmail.com
  wrote:
 
  Hi,
 
  This is to call for a vote on releasing the following candidate as
  Apache
  Helix 0.6.5. This is the 8th release of Helix as an Apache project,
 as
  well
  as the 4th release as a top-level Apache project.
 
  Apache Helix is a generic cluster management framework that makes it
  easy
  to build partitioned and replicated, fault-tolerant and scalable
  distributed systems.
 
  Release notes:
  http://helix.apache.org/releasenotes/release-0.6.5.html
 
  Release artifacts:
 
  https://repository.apache.org/content/repositories/orgapachehelix-1004
 
  Distribution:
  * binaries:
  https://dist.apache.org/repos/dist/dev/helix/0.6.5/binaries/
  * sources:
  https://dist.apache.org/repos/dist/dev/helix/0.6.5/src/
 
  The 0.6.5 release tag:
 
 
 
 
 https://git-wip-us.apache.org/repos/asf?p=helix.git;a=tag;h=refs/tags/helix-0.6.5
 
  KEYS file available here:
  https://dist.apache.org/repos/dist/dev/helix/KEYS
 
  Please vote on the release. The vote will be open for at least 72

Re: Cutting a release 0.6.5 tonight

2015-03-11 Thread kishore g
Will write a test case for migration and document the steps
On Mar 11, 2015 11:37 AM, Varun Sharma va...@pinterest.com wrote:

 What would be the migration path from non-compressed buckets to compressed
 non bucket resources ? It seems even the CURRENTSTATES are being bucketed
 in this case, I thought that was not expected with bucketing. Does the
 controller read these current states appropriately ? To migrate, it seems
 that we would need to also rewrite the CURRENT STATES ?

 On Wed, Mar 11, 2015 at 10:02 AM, kishore g g.kish...@gmail.com wrote:

 Hi,

 I will work with Jason to cut a 0.6.5 release tonight.

 The new thing I added is to enableCompression while storing data in
 Zookeeper, this allows us to go up to 100k partitions per resource without
 having to use bucketing feature. We also fixed few bugs with bucketed
 resource just in case some one needs it.

 The property store api needs some changes, I plan to get it in today.

 Let me know if you need any other changes to be included. Are there any
 changes that went into 0.7.x branch that we need to merge it back in to
 0.6.x ?

 thanks,
 Kishore G









Re: Cutting a release 0.6.5 tonight

2015-03-11 Thread kishore g
Thanks Lei, I fixed the first two.

Here is what caused the failure. I added the code to copy all simple fields
from IS to EV when we update the ExternalView in controller. This is to
enable compression in ExternalView if its set in IS. I could have copied
only enableCompression variable but I thought its good to have partition
number/replica etc in ExternalView as well. Let me know if you foresee any
problem in this. Only thing I could think of is IdealState is deleted in
which case I copy the simplefields from existing externalview.

I dont understand why testschedulermsg test cases are failing. Uncommenting
my code does not help either.


On Wed, Mar 11, 2015 at 4:06 PM, Kanak Biscuitwala kana...@hotmail.com
wrote:

 The first two are concerning. I don't think the scheduler message test
 failures should block releases.

 
  From: l...@linkedin.com.INVALID
  To: dev@helix.apache.org
  CC: u...@helix.apache.org
  Subject: RE: Cutting a release 0.6.5 tonight
  Date: Wed, 11 Mar 2015 21:25:15 +
 
  The test failed are:
 
 
 org.apache.helix.integration.TestExternalViewUpdates.testExternalViewUpdates
 
 org.apache.helix.integration.TestEnableCompression.testEnableCompressionResource
  org.apache.helix.integration.TestSchedulerMessage.testSchedulerMsg3
  org.apache.helix.integration.TestSchedulerMessage.testSchedulerMsg4
 
 org.apache.helix.integration.TestSchedulerMessage.testSchedulerMsgContraints
 
 org.apache.helix.integration.TestSchedulerMessage.testSchedulerMsgUsingQueue
 
  I got these from my local build too. (mvn clean install package on
 helix-0.6.x)
 
 
 
  Thanks
  Lei
 
  --
 
  Lei Xia
  Software Engineer
  Data Infrastructure/Distributed Data Systems/Nuage
  LinkedIn
 
  l...@linkedin.com
  www.linkedin.com/in/lxia1
 
  
  From: kishore g [g.kish...@gmail.com]
  Sent: Wednesday, March 11, 2015 2:03 PM
  To: dev@helix.apache.org
  Cc: u...@helix.apache.org
  Subject: Re: Cutting a release 0.6.5 tonight
 
  Hi Lei,
 
  Can you point to the failures?
 
  thanks,
  Kishore G
 
  On Wed, Mar 11, 2015 at 1:19 PM, Lei Xia l...@linkedin.com.invalid
 wrote:
 
  Hi, Kishore
 
  I saw there are regression test failures from last two recent commits
  on 0.6.x branch, running from both local box and Linkedin's hudson jobs.
  Are we going to fix them before the release?
 
 
  Thanks
  Lei
 
  --
 
  Lei Xia
  Software Engineer
  Data Infrastructure/Distributed Data Systems/Nuage
  LinkedIn
 
  l...@linkedin.com
  www.linkedin.com/in/lxia1
 
  
  From: kishore g [g.kish...@gmail.com]
  Sent: Wednesday, March 11, 2015 12:04 PM
  To: u...@helix.apache.org
  Cc: dev@helix.apache.org
  Subject: Re: Cutting a release 0.6.5 tonight
 
  Will write a test case for migration and document the steps
  On Mar 11, 2015 11:37 AM, Varun Sharma va...@pinterest.com wrote:
 
  What would be the migration path from non-compressed buckets to
  compressed
  non bucket resources ? It seems even the CURRENTSTATES are being
 bucketed
  in this case, I thought that was not expected with bucketing. Does the
  controller read these current states appropriately ? To migrate, it
 seems
  that we would need to also rewrite the CURRENT STATES ?
 
  On Wed, Mar 11, 2015 at 10:02 AM, kishore g g.kish...@gmail.com
 wrote:
 
  Hi,
 
  I will work with Jason to cut a 0.6.5 release tonight.
 
  The new thing I added is to enableCompression while storing data in
  Zookeeper, this allows us to go up to 100k partitions per resource
  without
  having to use bucketing feature. We also fixed few bugs with bucketed
  resource just in case some one needs it.
 
  The property store api needs some changes, I plan to get it in today.
 
  Let me know if you need any other changes to be included. Are there
 any
  changes that went into 0.7.x branch that we need to merge it back in
 to
  0.6.x ?
 
  thanks,
  Kishore G
 
 
 
 
 
 
 
 




Re: Use compression to store data in ZK

2015-03-08 Thread kishore g
Yeah, we still need to support it but we can go a long way without
bucketing if we compress it. We know we can support 1k partitions with raw
json and no bucketing. By adding compression, we can probably go upto 10k
partitions (need to validate this) per resource without bucketing.

I plan to use GZIP to compress/uncompress. Let me know if there is
something better.

This is what I am planning to do. We have common ZNRecordSerializer to
serialize/deserialize the data. We can simply check for a
enableCompression in the simpleFields and if its true, we apply
compression. On deserializing we can check for the magic header of GZIP and
if it matches, we automatically decompress the data.

The advantage of this is we don't to change the api of ZNRecordSerializer
or how it is set in various places. When a resource is created if
compression is turned on we set enableCompression=true in the idealstate.
This will take care of compressing idealstate. We now have to copy this in
creation of current state and External View. We should carry it with
External View since the controller creates it. For the CurrentState its not
straightforward, since it is created by the participants and they don't
read the IdealState. We can punt on the current state hoping that size of
current state is inversely proportional to the number of nodes in the
system. And if there are large number of partitions, the number of nodes
might also be large (this is not necessarily true). The other option is to
set the enableCompression=true the first time the CurrentState ZNode is
created by the participant.

Let me know what you think.






On Sun, Mar 8, 2015 at 11:09 AM, Kanak Biscuitwala kana...@hotmail.com
wrote:

 I like this idea, but we would still need to support bucketizing either
 way because we cannot guarantee that the compressed version will be compact
 enough for every use case.

 What types of compression schemes are you planning to support?

 
  Date: Sat, 7 Mar 2015 22:30:15 -0800
  Subject: Use compression to store data in ZK
  From: g.kish...@gmail.com
  To: dev@helix.apache.org
 
  Hi,
 
  Currently we have bucketing as one of the options when the number of
  partitions are large. We have couple of bugs with the handling of
  bucketized resources (one of them is fatal).
 
  One of the reasons to split the znode is because we use JSON to store the
  data in ZNode. While JSON is good for debugging, its space inefficient.
 
  A better option before going to bucketing is to support compression of
  Ideal state, current state and External View. This also gives good
  performance.
 
  I plan to add this support and make it configurable. Feedback/suggestions
 
  thanks,
  Kishore G




Re: [jira] [Updated] (HELIX-571) Rack-aware rebalancer

2015-03-05 Thread kishore g
Is the idea to use the rack Id info in our rebalancers?

On Wed, Mar 4, 2015 at 1:46 PM, Zhen Zhang (JIRA) j...@apache.org wrote:


  [
 https://issues.apache.org/jira/browse/HELIX-571?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

 Zhen Zhang updated HELIX-571:
 -
 Description: If rack information is available, we sometimes need to
 put replica of the same partition on different racks to avoid them sharing
 the same failure domain. This is a serious concern especially when rack has
 a single switch.  (was: If rack information is available, we sometimes need
 to put replica of the same partition different racks to avoid them sharing
 the same failure domain. This is a serious concern especially when rack has
 a single switch.)

  Rack-aware rebalancer
  -
 
  Key: HELIX-571
  URL: https://issues.apache.org/jira/browse/HELIX-571
  Project: Apache Helix
   Issue Type: Bug
 Reporter: Zhen Zhang
 
  If rack information is available, we sometimes need to put replica of
 the same partition on different racks to avoid them sharing the same
 failure domain. This is a serious concern especially when rack has a single
 switch.



 --
 This message was sent by Atlassian JIRA
 (v6.3.4#6332)



Re: question about state transition

2015-03-03 Thread kishore g
Hi Gavin,

I am not sure if I got your question. What ideal state mode are you using
(AUTO, SEMI AUTO, CUSTOM)?

can you describe the scenario again may be using node ids N1, N2, N3.

thanks,
Kishore G

On Tue, Mar 3, 2015 at 12:40 AM, Gavin Li lyo.ga...@gmail.com wrote:

 +dev and commits

 -- Forwarded message --
 From: Gavin Li lyo.ga...@gmail.com
 Date: Tue, Mar 3, 2015 at 12:24 AM
 Subject: question about state transition
 To: u...@helix.apache.org, developm...@helix.apache.org


 Hi,

 We have each server handle some partitions, and we use Master Slave model.

 We need to do some work when transit from offline to slave, sometimes it
 takes long time. So when the server ranked higher in ideal state is up and
 doing the work during the transition of offline to slave, the other server
 is changed to slave from master.

 This is causing some period of time that there's no master at all which is
 problematic. Is it possible to bring the other server down when the higher
 rank server is transiting from slave to master instead of when transition
 from offline to slave?

 THanks,
 Gavin Li



Re: 0.7.2 release

2015-02-24 Thread kishore g
how do we know the things that we have pushed to 0.6.x but not to 0.7?



On Tue, Feb 24, 2015 at 4:48 PM, Zhen Zhang zzh...@linkedin.com.invalid
wrote:

 +1

 As far as I can think of, here are a few things we need to complete before
 releasing 0.7.2
 - merge fixes that have been done on 0.6.x but haven't been ported to
 master yet
 - fix 1 test failure
 - fix any other critical bugs we have in 0.7

 - Jason

 
 From: Greg Brandt [brandt.g...@gmail.com]
 Sent: Tuesday, February 24, 2015 4:05 PM
 To: u...@helix.apache.org
 Cc: dev@helix.apache.org
 Subject: Re: 0.7.2 release


 Bump.

 What needs to be done for this?

 -Greg

 On Feb 18, 2015 9:02 PM, kishore g g.kish...@gmail.commailto:
 g.kish...@gmail.com wrote:

 +1
 The recent bug we found is critical and would love to see that in the
 release.

 Thanks
 Kishore G

 On Feb 18, 2015 1:17 PM, Greg Brandt brandt.g...@gmail.commailto:
 brandt.g...@gmail.com wrote:
 Hey guys,

 Would it be possible to get an 0.7.2 release?

 I have a use case that uses helix-ipc, for which there were a few critical
 patches after 0.7.1 release.

 Thanks,
 -Greg



Re: 0.7.2 release

2015-02-18 Thread kishore g
+1
The recent bug we found is critical and would love to see that in the
release.

Thanks
Kishore G
On Feb 18, 2015 1:17 PM, Greg Brandt brandt.g...@gmail.com wrote:

 Hey guys,

 Would it be possible to get an 0.7.2 release?

 I have a use case that uses helix-ipc, for which there were a few critical
 patches after 0.7.1 release.

 Thanks,
 -Greg



Re: NPE during start up

2015-02-16 Thread kishore g
Is there any work around for this and is this fatal as Vlad mentioned?

On Mon, Feb 16, 2015 at 10:28 AM, Zhen Zhang nehzgn...@gmail.com wrote:

 There is a timing issue in ZkHelixParticipant#setupMsgHandler(). We should
 hook up ZK callback (line 347 in
 https://github.com/apache/helix/blob/master/helix-core/src/main/java/org/apache/helix/manager/zk/ZkHelixParticipant.java)
 after all message handler registrations are done (line 354 in
 https://github.com/apache/helix/blob/master/helix-core/src/main/java/org/apache/helix/manager/zk/ZkHelixParticipant.java).
 Fix is to move adding ZK callback to the end. Will add a test case that can
 reliably reproduce this issue.

 Thanks,
 Zhen


 On Sun, Feb 15, 2015 at 11:45 PM, Zhen Zhang nehzgn...@gmail.com wrote:

 might be some race conditions. need to double check this.
 On Feb 15, 2015 11:38 PM, Steph Meslin-Weber st...@tangency.co.uk
 wrote:

 Hi Kishore,

 That's right, the node doesn't process any state transitions. They
 should have been logged in the first set of logs had they occurred.

 Thanks,
 Steph
 On 16 Feb 2015 07:28, kishore g g.kish...@gmail.com wrote:

 Hi Steph,

 When the NPE occurs, do you get the state transition callbacks?

 thanks,
 Kishore G



 On Sun, Feb 15, 2015 at 11:23 PM, Steph Meslin-Weber 
 st...@tangency.co.uk wrote:

 Unfortunately it appears that when the NPE occurs,  dropping the
 participant no longer cleans up the related INSTANCE node. Perhaps some
 state is lost?

 Thanks,
 Steph
 On 16 Feb 2015 06:52, Zhen Zhang nehzgn...@gmail.com wrote:

 I think the NPE is not fatal. It happens when no message handler
 factory is registered for this message type. The message will not be
 removed and remain in UNREAD state. Later when the message handler 
 factory
 is registered via:
 DefaultMessagingService#registerMessageHandlerFactory, we will send a
 NOP message, which will in turn trigger HelixTaskExecutor to process all
 UNREAD messages. We should definitely fix this by logging a warning 
 message
 instead of throwing an NPE.

 Thanks,
 Jason


 On Sun, Feb 15, 2015 at 7:30 PM, kishore g g.kish...@gmail.com
 wrote:

 Controller assuming the state transition occurred is even more
 dangerous.





 On Sun, Feb 15, 2015 at 7:18 PM, vlad...@gmail.com 
 vlad...@gmail.com wrote:

 In my experience it was fatal. The callback would jot be called but
 the
 controller would somehow assume the state transition occurred.
 On Feb 15, 2015 7:13 PM, kishore g g.kish...@gmail.com wrote:

  Thanks Vlad. That explains the problem. That also explains how
 adding
  sleep of 3seconds work.
 
  Jason, is this exception fatal?. Will the message be processed
 again after
  the handler is added.
 
  thanks,
  Kishore G
 
  On Sun, Feb 15, 2015 at 6:41 PM, vlad...@gmail.com 
 vlad...@gmail.com
  wrote:
 
  https://issues.apache.org/jira/browse/HELIX-548
  On Feb 15, 2015 6:38 PM, kishore g g.kish...@gmail.com
 wrote:
 
   Hi Vlad,
  
   Was there any jira associated with it?
  
   thanks.
   Kishore G
  
   On Sun, Feb 15, 2015 at 4:36 PM, vlad...@gmail.com 
 vlad...@gmail.com
   wrote:
  
   Looks like the same problem we encountered recently.
  
   Regards,
   Vlad
   On Feb 15, 2015 4:35 PM, kishore g g.kish...@gmail.com
 wrote:
  
Steph described this problem on IRC.
   
He is using 0.7.1. On connecting to cluster he gets this NPE
   
http://pastebin.com/YE3fwK5i
   
java.lang.NullPointerException
at
   
  
 
 org.apache.helix.messaging.handling.HelixTaskExecutor.createMessageHandler(HelixTaskExecutor.java:661)
at
   
  
 
 org.apache.helix.messaging.handling.HelixTaskExecutor.onMessage(HelixTaskExecutor.java:581)
at
   
  
 
 org.apache.helix.manager.zk.ZkCallbackHandler.invoke(ZkCallbackHandler.java:202)
at
   
  
 
 org.apache.helix.manager.zk.ZkCallbackHandler.init(ZkCallbackHandler.java:336)
at
   
  
 
 org.apache.helix.manager.zk.ZkCallbackHandler.init(ZkCallbackHandler.java:130)
at
   
  
 
 org.apache.helix.manager.zk.ZkHelixConnection.addListener(ZkHelixConnection.java:533)
at
   
  
 
 org.apache.helix.manager.zk.ZkHelixConnection.addMessageListener(ZkHelixConnection.java:267)
at
   
  
 
 org.apache.helix.manager.zk.ZkHelixParticipant.setupMsgHandler(ZkHelixParticipant.java:347)
at
   
  
 
 org.apache.helix.manager.zk.ZkHelixParticipant.init(ZkHelixParticipant.java:383)
at
   
  
 
 org.apache.helix.manager.zk.ZkHelixParticipant.onConnected(ZkHelixParticipant.java:401)
at
   
  
 
 org.apache.helix.manager.zk.ZkHelixParticipant.start(ZkHelixParticipant.java:428)
at
   
  
 
 com.example.ProtostuffServerNode.spinUpParticipant(ProtostuffServerNode.java:134)
   
   
Here is his connection code.
   
http://pastebin.com/QRfVU1tc
   
private static HelixParticipant
 spinUpParticipant(HelixAdmin admin,
ParticipantId participantId) {
LOGGER.info(Starting

NPE during start up

2015-02-15 Thread kishore g
Steph described this problem on IRC.

He is using 0.7.1. On connecting to cluster he gets this NPE

http://pastebin.com/YE3fwK5i

java.lang.NullPointerException
at
org.apache.helix.messaging.handling.HelixTaskExecutor.createMessageHandler(HelixTaskExecutor.java:661)
at
org.apache.helix.messaging.handling.HelixTaskExecutor.onMessage(HelixTaskExecutor.java:581)
at
org.apache.helix.manager.zk.ZkCallbackHandler.invoke(ZkCallbackHandler.java:202)
at
org.apache.helix.manager.zk.ZkCallbackHandler.init(ZkCallbackHandler.java:336)
at
org.apache.helix.manager.zk.ZkCallbackHandler.init(ZkCallbackHandler.java:130)
at
org.apache.helix.manager.zk.ZkHelixConnection.addListener(ZkHelixConnection.java:533)
at
org.apache.helix.manager.zk.ZkHelixConnection.addMessageListener(ZkHelixConnection.java:267)
at
org.apache.helix.manager.zk.ZkHelixParticipant.setupMsgHandler(ZkHelixParticipant.java:347)
at
org.apache.helix.manager.zk.ZkHelixParticipant.init(ZkHelixParticipant.java:383)
at
org.apache.helix.manager.zk.ZkHelixParticipant.onConnected(ZkHelixParticipant.java:401)
at
org.apache.helix.manager.zk.ZkHelixParticipant.start(ZkHelixParticipant.java:428)
at
com.example.ProtostuffServerNode.spinUpParticipant(ProtostuffServerNode.java:134)


Here is his connection code.

http://pastebin.com/QRfVU1tc

private static HelixParticipant spinUpParticipant(HelixAdmin admin,
ParticipantId participantId) {
LOGGER.info(Starting up +participantId);
HelixConnection connection = new ZkHelixConnection(
ZK_ADDRESS);
connection.connect();
HelixParticipant participant =
connection.createParticipant(CLUSTER_ID,
participantId);
StateMachineEngine stateMach = participant.
getStateMachineEngine();

StateTransitionHandlerFactoryLocalTransitionHandler
transitionHandlerFactory = new OnlineOfflineHandlerFactory();
stateMach.registerStateModelFactory(STATE_MODEL_NAME,
transitionHandlerFactory);
participant.start();

admin.enableInstance(CLUSTER_NAME, participantId.toString(),
true);

return participant;
}

Adding 3s sleep after registerStateModelFactory works. Any idea what is
happening.

thanks,
Kishore G


Re: NPE during start up

2015-02-15 Thread kishore g
Controller assuming the state transition occurred is even more dangerous.





On Sun, Feb 15, 2015 at 7:18 PM, vlad...@gmail.com vlad...@gmail.com
wrote:

 In my experience it was fatal. The callback would jot be called but the
 controller would somehow assume the state transition occurred.
 On Feb 15, 2015 7:13 PM, kishore g g.kish...@gmail.com wrote:

  Thanks Vlad. That explains the problem. That also explains how adding
  sleep of 3seconds work.
 
  Jason, is this exception fatal?. Will the message be processed again
 after
  the handler is added.
 
  thanks,
  Kishore G
 
  On Sun, Feb 15, 2015 at 6:41 PM, vlad...@gmail.com vlad...@gmail.com
  wrote:
 
  https://issues.apache.org/jira/browse/HELIX-548
  On Feb 15, 2015 6:38 PM, kishore g g.kish...@gmail.com wrote:
 
   Hi Vlad,
  
   Was there any jira associated with it?
  
   thanks.
   Kishore G
  
   On Sun, Feb 15, 2015 at 4:36 PM, vlad...@gmail.com vlad...@gmail.com
 
   wrote:
  
   Looks like the same problem we encountered recently.
  
   Regards,
   Vlad
   On Feb 15, 2015 4:35 PM, kishore g g.kish...@gmail.com wrote:
  
Steph described this problem on IRC.
   
He is using 0.7.1. On connecting to cluster he gets this NPE
   
http://pastebin.com/YE3fwK5i
   
java.lang.NullPointerException
at
   
  
 
 org.apache.helix.messaging.handling.HelixTaskExecutor.createMessageHandler(HelixTaskExecutor.java:661)
at
   
  
 
 org.apache.helix.messaging.handling.HelixTaskExecutor.onMessage(HelixTaskExecutor.java:581)
at
   
  
 
 org.apache.helix.manager.zk.ZkCallbackHandler.invoke(ZkCallbackHandler.java:202)
at
   
  
 
 org.apache.helix.manager.zk.ZkCallbackHandler.init(ZkCallbackHandler.java:336)
at
   
  
 
 org.apache.helix.manager.zk.ZkCallbackHandler.init(ZkCallbackHandler.java:130)
at
   
  
 
 org.apache.helix.manager.zk.ZkHelixConnection.addListener(ZkHelixConnection.java:533)
at
   
  
 
 org.apache.helix.manager.zk.ZkHelixConnection.addMessageListener(ZkHelixConnection.java:267)
at
   
  
 
 org.apache.helix.manager.zk.ZkHelixParticipant.setupMsgHandler(ZkHelixParticipant.java:347)
at
   
  
 
 org.apache.helix.manager.zk.ZkHelixParticipant.init(ZkHelixParticipant.java:383)
at
   
  
 
 org.apache.helix.manager.zk.ZkHelixParticipant.onConnected(ZkHelixParticipant.java:401)
at
   
  
 
 org.apache.helix.manager.zk.ZkHelixParticipant.start(ZkHelixParticipant.java:428)
at
   
  
 
 com.example.ProtostuffServerNode.spinUpParticipant(ProtostuffServerNode.java:134)
   
   
Here is his connection code.
   
http://pastebin.com/QRfVU1tc
   
private static HelixParticipant spinUpParticipant(HelixAdmin admin,
ParticipantId participantId) {
LOGGER.info(Starting up +participantId);
HelixConnection connection = new ZkHelixConnection(
ZK_ADDRESS);
connection.connect();
HelixParticipant participant = connection.
createParticipant(CLUSTER_ID, participantId);
StateMachineEngine stateMach = participant.
getStateMachineEngine();
   
   
  StateTransitionHandlerFactoryLocalTransitionHandler
transitionHandlerFactory = new OnlineOfflineHandlerFactory();
   
  stateMach.registerStateModelFactory(STATE_MODEL_NAME,
transitionHandlerFactory);
participant.start();
   
admin.enableInstance(CLUSTER_NAME,
   participantId.toString(
), true);
   
return participant;
}
   
Adding 3s sleep after registerStateModelFactory works. Any idea
 what
  is
happening.
   
thanks,
Kishore G
   
   
   
   
  
  
  
 
 
 



Re: Not using ZK as metadata store

2015-02-05 Thread kishore g
I like the idea in general. Lets try to define the scalability problems of
ZK.

   - amount of data stored is limited(memory).
   - The number of paths where watches can be set is limited (I think it
   should fit within tcp max size)

We can also define the limits in terms of Helix nodes/resources etc.


On the flip side, using another KV store means we need the KV store to
provide same properties as Zookeeper i.e data consistency, replicas etc.

thanks,
Kishore G


On Thu, Feb 5, 2015 at 1:41 PM, Zhen Zhang nehzgn...@gmail.com wrote:

 Hi,

 As Helix is going to support bigger clusters with more nodes, more
 resources, and more partitions, using ZK as metadata store seems not
 scalable. This is especially the case for ideal states and external views
 where Helix stores all partition states and mappings. Alternatively, we can
 use for example, a key-value store for ideal states and external views,
 while leaving minimum information on ZK, say URL to the metadata store.
 Helix listeners will still get ZK callbacks on all state changes, then
 reads the URL, and retrieve the actual data from the more scalable metadata
 stores. Any thoughts?

 Thanks,
 Jason



Re: [jira] [Commented] (HELIX-561) Participant receive same transition twice

2014-12-16 Thread kishore g
is it possible to completely avoid this scenario in 0.7.1+

On Tue, Dec 16, 2014 at 10:19 AM, Zhen Zhang (JIRA) j...@apache.org wrote:


 [
 https://issues.apache.org/jira/browse/HELIX-561?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14248601#comment-14248601
 ]

 Zhen Zhang commented on HELIX-561:
 --

 [~vinayakb] are you using same partition name for different resources? we
 recently fixed a bug (see https://issues.apache.org/jira/browse/HELIX-552)
 regarding this. Previously a state model is indexed by partition id only
 and the assumption is we use the naming convention like
 resourceName_partitionId. If you are using the same partition name for
 different resources, then it's possible that the same state model is
 invoked for different resources.

  Participant receive same transition twice
  -
 
  Key: HELIX-561
  URL: https://issues.apache.org/jira/browse/HELIX-561
  Project: Apache Helix
   Issue Type: Bug
 Reporter: Zhen Zhang
 Assignee: Zhen Zhang
 
  Some user reports that when upgrade from 0.6.x to 0.7.1, participant
 receives OFFLINE-SLAVE transition twice for a partition.
  Can't reproduce it with the following test case:
  {noformat}
  package org.apache.helix;
  import java.util.Date;
  import org.apache.helix.api.StateTransitionHandlerFactory;
  import org.apache.helix.api.TransitionHandler;
  import org.apache.helix.api.id.PartitionId;
  import org.apache.helix.api.id.ResourceId;
  import org.apache.helix.api.id.StateModelDefId;
  import org.apache.helix.manager.zk.MockController;
  import org.apache.helix.model.Message;
  import org.apache.helix.participant.statemachine.Transition;
  import org.apache.helix.testutil.TestUtil;
  import org.apache.helix.testutil.ZkTestBase;
  import org.testng.annotations.Test;
  public class AppTest extends ZkTestBase {
@Test
public void test() throws Exception {
  String clusterName = TestUtil.getTestName();
  int n = 2;
  System.out.println(START  + clusterName +  at  + new
 Date(System.currentTimeMillis()));
  TestHelper.setupCluster(clusterName, _zkaddr, 12918, // participant
 port
  localhost, // participant name prefix
  TestDB, // resource name prefix
  1, // resources
  2, // partitions per resource
  n, // number of nodes
  2, // replicas
  MasterSlave, true); // do rebalance
  MockController controller = new MockController(_zkaddr, clusterName,
 controller);
  controller.syncStart();
  String id = localhost_12918;
  StateModelDefId masterSlave = StateModelDefId.from(MasterSlave);
  HelixManager hManager =
  HelixManagerFactory.getZKHelixManager(clusterName, id,
 InstanceType.PARTICIPANT, _zkaddr);
 
  hManager.getStateMachineEngine().registerStateModelFactory(masterSlave,
  new MasterSlaveStateModelFactory());
  hManager.connect();
  System.out.println(END  + clusterName +  at  + new
 Date(System.currentTimeMillis()));
}
class MasterSlaveStateModelFactory extends
 StateTransitionHandlerFactoryMasterSlaveStateModel {
  public MasterSlaveStateModelFactory() {
  }
  @Override
  public MasterSlaveStateModel createStateTransitionHandler(ResourceId
 resourceId,
  PartitionId partitionId) {
return new MasterSlaveStateModel();
  }
}
public class MasterSlaveStateModel extends TransitionHandler {
  @Transition(to = SLAVE, from = OFFLINE)
  public void onBecomeSlaveFromOffline(Message message,
 NotificationContext context)
  throws Exception {
System.out.println(message.getPartitionName() + :
 OFFLINE-SLAVE);
  }
  @Transition(to = MASTER, from = SLAVE)
  public void onBecomeMasterFromSlave(Message message,
 NotificationContext context)
  throws Exception {
  }
  @Transition(to = SLAVE, from = MASTER)
  public void onBecomeSlaveFromMaster(Message message,
 NotificationContext context)
  throws Exception {
  }
  @Transition(to = OFFLINE, from = SLAVE)
  public void onBecomeOfflineFromSlave(Message message,
 NotificationContext context)
  throws Exception {
  }
  @Transition(to = DROPPED, from = OFFLINE)
  public void onBecomeDroppedFromOffline(Message message,
 NotificationContext context)
  throws Exception {
  }
  @Transition(to = OFFLINE, from = ERROR)
  public void onBecomeOfflineFromError(Message message,
 NotificationContext context)
  throws Exception {
  }
}
  }
  {noformat}



 --
 This message was sent by Atlassian JIRA
 (v6.3.4#6332)



Re: [jira] [Commented] (HELIX-561) Participant receive same transition twice

2014-12-16 Thread kishore g
lets do that.

On Tue, Dec 16, 2014 at 11:04 AM, Zhen Zhang nehzgn...@gmail.com wrote:

 if we are using both resourceName and partitonId for indexing a state
 model, this problem should be fixed.

 On Tue, Dec 16, 2014 at 10:25 AM, kishore g g.kish...@gmail.com wrote:
 
  is it possible to completely avoid this scenario in 0.7.1+
 
  On Tue, Dec 16, 2014 at 10:19 AM, Zhen Zhang (JIRA) j...@apache.org
  wrote:
 
  
   [
  
 
 https://issues.apache.org/jira/browse/HELIX-561?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14248601#comment-14248601
   ]
  
   Zhen Zhang commented on HELIX-561:
   --
  
   [~vinayakb] are you using same partition name for different resources?
 we
   recently fixed a bug (see
  https://issues.apache.org/jira/browse/HELIX-552)
   regarding this. Previously a state model is indexed by partition id
 only
   and the assumption is we use the naming convention like
   resourceName_partitionId. If you are using the same partition name for
   different resources, then it's possible that the same state model is
   invoked for different resources.
  
Participant receive same transition twice
-
   
Key: HELIX-561
URL: https://issues.apache.org/jira/browse/HELIX-561
Project: Apache Helix
 Issue Type: Bug
   Reporter: Zhen Zhang
   Assignee: Zhen Zhang
   
Some user reports that when upgrade from 0.6.x to 0.7.1, participant
   receives OFFLINE-SLAVE transition twice for a partition.
Can't reproduce it with the following test case:
{noformat}
package org.apache.helix;
import java.util.Date;
import org.apache.helix.api.StateTransitionHandlerFactory;
import org.apache.helix.api.TransitionHandler;
import org.apache.helix.api.id.PartitionId;
import org.apache.helix.api.id.ResourceId;
import org.apache.helix.api.id.StateModelDefId;
import org.apache.helix.manager.zk.MockController;
import org.apache.helix.model.Message;
import org.apache.helix.participant.statemachine.Transition;
import org.apache.helix.testutil.TestUtil;
import org.apache.helix.testutil.ZkTestBase;
import org.testng.annotations.Test;
public class AppTest extends ZkTestBase {
  @Test
  public void test() throws Exception {
String clusterName = TestUtil.getTestName();
int n = 2;
System.out.println(START  + clusterName +  at  + new
   Date(System.currentTimeMillis()));
TestHelper.setupCluster(clusterName, _zkaddr, 12918, //
 participant
   port
localhost, // participant name prefix
TestDB, // resource name prefix
1, // resources
2, // partitions per resource
n, // number of nodes
2, // replicas
MasterSlave, true); // do rebalance
MockController controller = new MockController(_zkaddr,
  clusterName,
   controller);
controller.syncStart();
String id = localhost_12918;
StateModelDefId masterSlave =
 StateModelDefId.from(MasterSlave);
HelixManager hManager =
HelixManagerFactory.getZKHelixManager(clusterName, id,
   InstanceType.PARTICIPANT, _zkaddr);
   
  
 hManager.getStateMachineEngine().registerStateModelFactory(masterSlave,
new MasterSlaveStateModelFactory());
hManager.connect();
System.out.println(END  + clusterName +  at  + new
   Date(System.currentTimeMillis()));
  }
  class MasterSlaveStateModelFactory extends
   StateTransitionHandlerFactoryMasterSlaveStateModel {
public MasterSlaveStateModelFactory() {
}
@Override
public MasterSlaveStateModel
  createStateTransitionHandler(ResourceId
   resourceId,
PartitionId partitionId) {
  return new MasterSlaveStateModel();
}
  }
  public class MasterSlaveStateModel extends TransitionHandler {
@Transition(to = SLAVE, from = OFFLINE)
public void onBecomeSlaveFromOffline(Message message,
   NotificationContext context)
throws Exception {
  System.out.println(message.getPartitionName() + :
   OFFLINE-SLAVE);
}
@Transition(to = MASTER, from = SLAVE)
public void onBecomeMasterFromSlave(Message message,
   NotificationContext context)
throws Exception {
}
@Transition(to = SLAVE, from = MASTER)
public void onBecomeSlaveFromMaster(Message message,
   NotificationContext context)
throws Exception {
}
@Transition(to = OFFLINE, from = SLAVE)
public void onBecomeOfflineFromSlave(Message message,
   NotificationContext context)
throws Exception {
}
@Transition(to = DROPPED, from = OFFLINE)
public void onBecomeDroppedFromOffline(Message message,
   NotificationContext context

Re: [GitHub] helix pull request: adding python helix admin library

2014-12-07 Thread kishore g
can we put the docs as README.md in this folder?

On Sun, Dec 7, 2014 at 7:03 PM, kanakb g...@git.apache.org wrote:

 Github user kanakb commented on the pull request:

 https://github.com/apache/helix/pull/14#issuecomment-65970838

 LGTM, merged


 ---
 If your project is set up for it, you can reply to this email and have your
 reply appear on GitHub as well. If your project does not have this feature
 enabled and wishes so, or if the feature is enabled but not working, please
 contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
 with INFRA.
 ---



Re: [jira] [Created] (HELIX-557) Specifying cluster zookeeper path root

2014-11-25 Thread kishore g
You can provide the root path as part of zookeeper connection string.

E.G. host:2181/my/root/path

All nodes will be created understand /my/root/path.

Thanks
Kishore G
On Nov 25, 2014 3:48 AM, Robert Cross (JIRA) j...@apache.org wrote:

 Robert Cross created HELIX-557:
 --

  Summary: Specifying cluster zookeeper path root
  Key: HELIX-557
  URL: https://issues.apache.org/jira/browse/HELIX-557
  Project: Apache Helix
   Issue Type: Improvement
   Components: helix-core
 Affects Versions: 0.6.4, 0.7.1
 Reporter: Robert Cross


 Hi,

 I would like to be able to specify a root path for the znode structure
 that helix creates for cluster management. From what I can see, it is
 pretty hard coded to be the core root of the zookeeper tree.

 Is this possible?

 Thanks
 Rob



 --
 This message was sent by Atlassian JIRA
 (v6.3.4#6332)



Re: [jira] [Created] (HELIX-531) Extend controller to support multi-dimensional mapping

2014-10-21 Thread kishore g
Love this. This is exciting!

On Tue, Oct 21, 2014 at 11:14 AM, Zhen Zhang (JIRA) j...@apache.org wrote:

 Zhen Zhang created HELIX-531:
 

  Summary: Extend controller to support multi-dimensional
 mapping
  Key: HELIX-531
  URL: https://issues.apache.org/jira/browse/HELIX-531
  Project: Apache Helix
   Issue Type: Sub-task
 Reporter: Zhen Zhang


 Instead of calculating best-possible mappings on a single dimension,
 controller pipeline should be able to calculate best-possible mappings on
 multiple dimensions given state-machines on each dimension and dependency
 among different dimensions.



 --
 This message was sent by Atlassian JIRA
 (v6.3.4#6332)



Writing custom rebalancer

2014-10-10 Thread kishore g
Hi,

Even though we have couple of ways of writing custom rebalancer (one on
participant side and another on the controller side), I dont think its
trivial for some one to write them without understanding all the internal
details of Helix.

I am starting this thread to see if others have any thoughts on making it
easier for some to get started and write their own rebalancer as part of
the quick start.

thanks,
Kishore G


Re: [jira] [Commented] (HELIX-524) add getProgress() to Task interface

2014-10-09 Thread kishore g
Yes, thats the idea. With helix-ipc this will be possible rt? This can be
extended to write rebalancers that talk to nodes to get the high water mark
to decide new master. what do you think?

On Thu, Oct 9, 2014 at 3:28 PM, ASF GitHub Bot (JIRA) j...@apache.org
wrote:


 [
 https://issues.apache.org/jira/browse/HELIX-524?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14165886#comment-14165886
 ]

 ASF GitHub Bot commented on HELIX-524:
 --

 Github user zzhang5 commented on the pull request:

 https://github.com/apache/helix/pull/6#issuecomment-58587902

 If we are not using ZK to invoke these methods, are we opening some
 kind of end-point e.g. via Netty or JMX on each participant?




  add getProgress() to Task interface
  ---
 
  Key: HELIX-524
  URL: https://issues.apache.org/jira/browse/HELIX-524
  Project: Apache Helix
   Issue Type: Improvement
   Components: helix-core
 Affects Versions: 0.6.4
 Reporter: Hongbo Zeng
  Fix For: 0.6.5
 
 
  Add a getProgress to the Task interface, this is very helpful for long
 running tasks, from which we know the status of a task and see if it's
 blocked. The return value is a double, ranging from 0 to 1.0, 1.0 indicates
 a task is finished



 --
 This message was sent by Atlassian JIRA
 (v6.3.4#6332)



Helix article

2014-10-06 Thread kishore g
FYI

http://thenewstack.io/helix-a-linkedin-framework-for-distributed-systems-development/

Thanks to Kanak, Sandeep and Kapil for the all the help.

cheers,
Kishore G


Re: [jira] [Commented] (HELIX-504) Controller should avoid resetting watches on removed paths

2014-09-23 Thread kishore g
Can we restart controller as a work around
On Sep 23, 2014 10:11 AM, Zhen Zhang (JIRA) j...@apache.org wrote:


 [
 https://issues.apache.org/jira/browse/HELIX-504?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14145049#comment-14145049
 ]

 Zhen Zhang commented on HELIX-504:
 --

 On investigating zookeeper watch leakage problem, it turns out to be a
 zookeeper issue:
 https://issues.apache.org/jira/browse/ZOOKEEPER-442

 For zookeeper before 3.5.0, we can't  remove watches that are no longer of
 interests. The only way to remove a watch is to trigger it; that is, if it
 is a DataWatch, we need to trigger a data change on the watching path, or
 if it is a ChildWatch, we need to trigger a child change on the watching
 path. Unfortunately, if we are watching a path that has been deleted,
 unless we re-create the path, there is no way we can remove the watch.

 Here are some of the most common scenarios where we will have dead
 zookeeper watches on zookeeper server side even though we unregister all
 the listeners on the zookeeper client side:

 - When we drop a resource group from a cluster, we may have dead watches
 on ideal-state, participant current-state, and external-view
 - When we remove an instance from a cluster, we may have dead watches on
 current-state, participant-config, and participant messages
 - When we use property store with caches enabled by zookeeper watches, we
 may have dead watches on all removed paths

 Will create separate jiras to track the issue.

  Controller should avoid resetting watches on removed paths
  --
 
  Key: HELIX-504
  URL: https://issues.apache.org/jira/browse/HELIX-504
  Project: Apache Helix
   Issue Type: Bug
 Reporter: Kanak Biscuitwala
 Assignee: Zhen Zhang
 
  Right now, if a participant has a session change, the old current state
 will be removed, but the controller despite unregistering the listener,
 still reset the watch on the path deleted event. We should avoid resetting
 this path on delete and let the controller explicitly set watches/listeners
 on new sessions.



 --
 This message was sent by Atlassian JIRA
 (v6.3.4#6332)



Re: [jira] [Commented] (HELIX-504) Controller should avoid resetting watches on removed paths

2014-09-23 Thread kishore g
Jason, correct me if I am wrong. We are already releasing the watches.the
problem is with zookeeper client.

After a path is deleted, we cannot remove the watch. This is a zookeeper
bug and is fixed in they latest version.

Thanks
Kishore G
On Sep 23, 2014 11:03 AM, Sandeep Nayak osgig...@gmail.com wrote:

 Why not start tracking all the watches created by extending the watch
 with our own implementation which can have shutdown ability? That way
 we can kill all watches when we find them stale. Is the concern that
 there are too many of them?

 On Tue, Sep 23, 2014 at 10:49 AM, Zhen Zhang nehzgn...@gmail.com wrote:
  it will remove the dead watch associated with the controller. but not the
  dead watch associated the participant and spectator.
 
 
  On Tue, Sep 23, 2014 at 10:39 AM, kishore g g.kish...@gmail.com wrote:
 
  Can we restart controller as a work around
  On Sep 23, 2014 10:11 AM, Zhen Zhang (JIRA) j...@apache.org wrote:
 
  
   [
  
 
 https://issues.apache.org/jira/browse/HELIX-504?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14145049#comment-14145049
   ]
  
   Zhen Zhang commented on HELIX-504:
   --
  
   On investigating zookeeper watch leakage problem, it turns out to be a
   zookeeper issue:
   https://issues.apache.org/jira/browse/ZOOKEEPER-442
  
   For zookeeper before 3.5.0, we can't  remove watches that are no
 longer
  of
   interests. The only way to remove a watch is to trigger it; that is,
 if
  it
   is a DataWatch, we need to trigger a data change on the watching
 path, or
   if it is a ChildWatch, we need to trigger a child change on the
 watching
   path. Unfortunately, if we are watching a path that has been deleted,
   unless we re-create the path, there is no way we can remove the watch.
  
   Here are some of the most common scenarios where we will have dead
   zookeeper watches on zookeeper server side even though we unregister
 all
   the listeners on the zookeeper client side:
  
   - When we drop a resource group from a cluster, we may have dead
 watches
   on ideal-state, participant current-state, and external-view
   - When we remove an instance from a cluster, we may have dead watches
 on
   current-state, participant-config, and participant messages
   - When we use property store with caches enabled by zookeeper
 watches, we
   may have dead watches on all removed paths
  
   Will create separate jiras to track the issue.
  
Controller should avoid resetting watches on removed paths
--
   
Key: HELIX-504
URL:
 https://issues.apache.org/jira/browse/HELIX-504
Project: Apache Helix
 Issue Type: Bug
   Reporter: Kanak Biscuitwala
   Assignee: Zhen Zhang
   
Right now, if a participant has a session change, the old current
 state
   will be removed, but the controller despite unregistering the
 listener,
   still reset the watch on the path deleted event. We should avoid
  resetting
   this path on delete and let the controller explicitly set
  watches/listeners
   on new sessions.
  
  
  
   --
   This message was sent by Atlassian JIRA
   (v6.3.4#6332)
  
 



meetup

2014-09-22 Thread kishore g
Hi,

Just want to gauge the interest in having a meetup. We have few new things
we can discuss apart from questions.
-- Task management framework
-- Helix IPC

Please let us know if this will be useful.

thanks
Kishore G


Re: [GitHub] helix pull request: refactored Netty IPC code into separate classe...

2014-09-16 Thread kishore g
Do we have any documented on how to merge this pull request. Will be nice
to have a script that does some validation?

thanks,
Kishore G

On Tue, Sep 16, 2014 at 9:49 AM, brandtg g...@git.apache.org wrote:

 GitHub user brandtg opened a pull request:

 https://github.com/apache/helix/pull/5

 refactored Netty IPC code into separate classes



 You can merge this pull request into a Git repository by running:

 $ git pull https://github.com/brandtg/helix master

 Alternatively you can review and apply these changes as the patch at:

 https://github.com/apache/helix/pull/5.patch

 To close this pull request, make a commit to your master/trunk branch
 with (at least) the following in the commit message:

 This closes #5

 
 commit 21ecdc7c5ef7f4e2ad7a98239b5d70164ee83505
 Author: Greg Brandt brandt.g...@gmail.com
 Date:   2014-09-16T16:48:14Z

 refactored Netty IPC code into separte classes

 


 ---
 If your project is set up for it, you can reply to this email and have your
 reply appear on GitHub as well. If your project does not have this feature
 enabled and wishes so, or if the feature is enabled but not working, please
 contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
 with INFRA.
 ---



  1   2   >