I tried to build master (1.3.0-SNAPSHOT) but updated the zookeeper dependency to version 3.4.10. I am not able to build successfully. A compilation error results:
[ERROR] Failed to execute goal org.apache.maven.plugins:maven-compiler-plugin:3.2:compile (default-compile) on project nifi-framework-core: Compilation failure [ERROR] /nifi/nifi-nar/bundles/nifi-framework-bundle/nifi-framework/nifi-framework-core/src/main/java/org/apache/nifi/controller/state/server/ZooKeeperStateServer.java: [106,25] error: no suitable constructor found for QuorumPeer(no arguments) On Tue, May 30, 2017 at 11:33 PM, Joe Witt <[email protected]> wrote: > Just scanning through the items currently on master that would show up > in the 1.3.0 release we see numerous cluster related bug fixes. > > More consistent port alignment across cluster > https://issues.apache.org/jira/browse/NIFI-3981 > > Ensure controller service lifecycle handled better with different > timing/dependencies > https://issues.apache.org/jira/browse/NIFI-3972 > > Insufficient heartbeat handling causing improper clustering behavior > https://issues.apache.org/jira/browse/NIFI-3933 > > Improve timing of component startup relative to other lifecycle items > when clustered > https://issues.apache.org/jira/browse/NIFI-3923 > > Inconsistent scheduled state in some cluster settings > https://issues.apache.org/jira/browse/NIFI-3900 > > Improved fingerprinted/non-fingerprinted settings enforcement and > handling in clusters > https://issues.apache.org/jira/browse/NIFI-1963 > > These are nifi specific cluster behavior things. For nifi and > zookeeper interaction specifically most of the focus this far has been > about NiFi itself as the above JIRAs show and also of course the cases > where a given system that is so resource contended will simply not > have a nice embedded ZK/nifi experience. > > MarkB, your testing above suggests you were using a nifi 1.x which > means a zookeeper 3.4.6 client against a Zookeeper 3.4.10 server > cluster and behavior was much better. Could you possibly run the same > cluster evaluation against the latest master but with an embedded > zookeeper 3.4.10 version in nifi (which means both server and client > are on latest zk 3.4.10 release)? This would be helpful data. > Assuming that goes well the only other concern that jumps to mind is > if us using a zookeeper 3.4.10 client presents problems for us talking > to older server versions (still 3.4 though so probably ok, i'd hope). > In general we should be safe thanks to classloader isolation but we've > seen some pretty magical JVM/system classloader level changes happen > for Kerberized environments. > > Thanks > Joe > > > > On Tue, May 30, 2017 at 3:21 PM, Juan Sequeiros <[email protected]> > wrote: > > Hello all, > > > > I'll like to chime in on this interesting discussion thread. > > > > I'll like to add that my system(s) too have seen unstable ZK interaction > > with both embedded and eventually external ZK ( granted external has been > > better ) interaction. > > We have resolved them with NIFI restarts. And it's to the point that we > are > > hesitant to roll up to NIFI 1.X mainly because of this ( we have DEV NIFI > > 1.X ) > > > > I also would like to add that we are greatly anticipating ZK release > 3.5.X > > for its TLS implementation, and as such have not voiced our experience > with > > NIFI / ZOOKEEPER assuming that once ZOOKEEPER 3.5.X is out of ALPHA that > it > > would be added in to NIFI NAR framework fairly fast and fix the oddities. > > > > I would say though that we have been hoping for a newer client on NIFI ZK > > side since the current one suggests its based off 3.4.6 ZOOKEEPER which > was > > released on *MAR 2014*. > > > > # jar tc nifi-framework-nar-1.1.1.nar | grep zoo > > META-INF/bundled-dependencies/zookeper-3.4.6.jar > > > > And now I wonder how long it would take for NIFI to code release a client > > based off 3.5.X once it goes official given hesitation on forward > > capability. > > > > > > On Tue, May 30, 2017 at 2:52 PM Jeff <[email protected]> wrote: > > > >> Joe, > >> > >> My own direct and indirect experiences with NiFi 1.x clustering have > been > >> good for both embedded and external zookeeper but we have certainly seen > >> some emails on mailing-list about it. Those have been for high load case > >> where the embedded approach would be susceptible to timing issues and > >> resolved by using an external system. Mark Bean's report is interesting > >> though since it happens under no real load at all. > >> > >> I suspect ZOOKEEPER-2044 will help that though there are several > comments > >> [1] (and others on that JIRA) that describe the issue as minor/false > >> reporting/cosmetic/an improvement. Updating to ZooKeeper 3.4.10 suggests > >> that this rare issue can be resolved in NiFi, but we'll have to do our > due > >> diligence to make sure that no new issues are raised with the upgrade > for > >> NiFi or its ability to interface with external systems. We'll have to do > >> testing with other dependencies that use ZooKeeper 3.4.6 to ensure that > >> forward capability. > >> > >> [1] > >> > >> https://issues.apache.org/jira/browse/ZOOKEEPER-2044? > focusedCommentId=15024616&page=com.atlassian.jira. > plugin.system.issuetabpanels:comment-tabpanel#comment-15024616 > >> > >> Thanks, > >> Jeff > >> > >> On Tue, May 30, 2017 at 1:15 PM Joe Skora <[email protected]> wrote: > >> > >> > Jeff, > >> > > >> > If I understand the issue correctly, this means NiFi 1.x has always > been > >> > broken for clustering with an embedded ZooKeeper. That has never > >> > communicated until now, we clearly build for and explain how to use an > >> > embedded ZooKeeper in documentation. > >> > > >> > Any external non-NiFi elements that are considered in design and > >> dependency > >> > decisions need to be clearly understood by the entire community. What > >> > things non-NiFi are you thinking of that drive ZooKeeper dependencies? > >> > > >> > Joe > >> > > >> > On Tue, May 30, 2017 at 9:11 AM, Jeff <[email protected]> wrote: > >> > > >> > > Mark, we can certainly take smaller steps rather than waiting for > >> > > 3.5.2/3.6.0 to come out. I was just bringing that JIRA up as > another > >> > > scenario that entices us to upgrade. > >> > > > >> > > Joe, I'm referring to NiFi, the toolkit, and things non-NiFi that > >> > provide a > >> > > ZK server to which NiFi or the ZK Migration Toolkit are clients. > I'm > >> not > >> > > saying we can't or shouldn't upgrade, but we do need to test to make > >> sure > >> > > that no issues are introduced by NiFi shipping with ZK 3.4.10. > Being > >> > that > >> > > it's a bugfix version change, it's probably fine. > >> > > > >> > > - Jeff > >> > > > >> > > On Tue, May 30, 2017 at 10:46 AM Joe Skora <[email protected]> > wrote: > >> > > > >> > > > Jeff, > >> > > > > >> > > > Does that mean NiFi 1.x will be unstable when using embedded > >> ZooKeeper > >> > > > until the ZK version is upgrade? > >> > > > > >> > > > By "components outside of NiFi" do you mean the NiFi toolkit and > >> other > >> > > > parts of the NiFi release? > >> > > > > >> > > > Joe > >> > > > > >> > > > On Tue, May 30, 2017 at 5:42 AM, Jeff <[email protected]> wrote: > >> > > > > >> > > > > Mark, > >> > > > > > >> > > > > I did report a JIRA [1] for upgrading to 3.5.2 or 3.6.0 (just > due > >> to > >> > > > log4j > >> > > > > issues) once it's out and stable, There are issues with the way > >> that > >> > ZK > >> > > > > refers to log4j classes in the code that cause issues for NiFi > and > >> > our > >> > > > > Toolkit.. However there has been some back and forth [2] (in > >> 3.4.0, > >> > > > which > >> > > > > doesn't fix the issue, but moves towards fixing it), [3], and > [4] > >> on > >> > > the > >> > > > > changes being implemented in versions 3.5.2 and 3.6.0. Also, it > >> > looks > >> > > > like > >> > > > > ZK 3.6.0 is headed toward using log4j 2 [5]. > >> > > > > > >> > > > > There are many components outside of NiFi that are still using > ZK > >> > > 3.4.6, > >> > > > so > >> > > > > it may be a while before we can move to 3.4.10. I don't > currently > >> > know > >> > > > > anything about the forward compatibility of 3.4.6. Are there > >> > > > > improvements/fixes in 3.4.10 which you need? > >> > > > > > >> > > > > [1] https://issues.apache.org/jira/browse/NIFI-3067 > >> > > > > [2] https://issues.apache.org/jira/browse/ZOOKEEPER-850 > >> > > > > [3] https://issues.apache.org/jira/browse/ZOOKEEPER-1371 > >> > > > > [4] https://issues.apache.org/jira/browse/ZOOKEEPER-2393 > >> > > > > [5] https://issues.apache.org/jira/browse/ZOOKEEPER-2342 > >> > > > > > >> > > > > - Jeff > >> > > > > > >> > > > > On Tue, May 30, 2017 at 8:15 AM Mark Bean < > [email protected]> > >> > > wrote: > >> > > > > > >> > > > > > Updated to external ZooKeeper last Friday. Over the weekend, > >> there > >> > > are > >> > > > no > >> > > > > > reports of SUSPENDED or RECONNECTED. > >> > > > > > > >> > > > > > Are there plans to upgrade the embedded ZooKeeper to the > latest > >> > > > version, > >> > > > > > 3.4.10? > >> > > > > > > >> > > > > > Thanks, > >> > > > > > Mark > >> > > > > > > >> > > > > > On Thu, May 25, 2017 at 11:56 AM, Joe Witt < > [email protected]> > >> > > wrote: > >> > > > > > > >> > > > > > > looked at a secured cluster and the send times are > routinely at > >> > > 100ms > >> > > > > > > similar to yours. I think what i was flagging as > potentially > >> > > > > > > interesting is not interesting at all. > >> > > > > > > > >> > > > > > > On Thu, May 25, 2017 at 11:34 AM, Joe Witt < > [email protected] > >> > > >> > > > wrote: > >> > > > > > > > Ok. Well as a point of comparison i'm looking at > heartbeat > >> > logs > >> > > > from > >> > > > > > > > another cluster and the times are consistently 1-3 millis > for > >> > the > >> > > > > > > > send. Yours above show 100+ms typical with one north of > >> 900ms. > >> > > > Not > >> > > > > > > > sure how relevant that is but something i noticed. > >> > > > > > > > > >> > > > > > > > On Thu, May 25, 2017 at 11:29 AM, Mark Bean < > >> > > [email protected] > >> > > > > > >> > > > > > > wrote: > >> > > > > > > >> ping shows acceptably fast response time between servers, > >> > > > > > approximately > >> > > > > > > >> 0.100-0.150 ms > >> > > > > > > >> > >> > > > > > > >> > >> > > > > > > >> On Thu, May 25, 2017 at 11:13 AM, Joe Witt < > >> > [email protected]> > >> > > > > > wrote: > >> > > > > > > >> > >> > > > > > > >>> have you evaluated latency across the machines in your > >> > cluster? > >> > > > I > >> > > > > > ask > >> > > > > > > >>> because 122ms is pretty long and 917ms is very long. > Are > >> > these > >> > > > > nodes > >> > > > > > > >>> across a WAN link? > >> > > > > > > >>> > >> > > > > > > >>> On Thu, May 25, 2017 at 11:08 AM, Mark Bean < > >> > > > [email protected] > >> > > > > > > >> > > > > > > wrote: > >> > > > > > > >>> > Update: now all 5 nodes, regardless of ZK server, are > >> > > > indicating > >> > > > > > > >>> SUSPENDED > >> > > > > > > >>> > -> RECONNECTED. > >> > > > > > > >>> > > >> > > > > > > >>> > On Thu, May 25, 2017 at 10:23 AM, Mark Bean < > >> > > > > [email protected] > >> > > > > > > > >> > > > > > > >>> wrote: > >> > > > > > > >>> > > >> > > > > > > >>> >> I reduced the number of embedded ZooKeeper servers on > >> the > >> > > > 5-Node > >> > > > > > > NiFi > >> > > > > > > >>> >> Cluster from 5 to 3. This has improved the > situation. I > >> do > >> > > not > >> > > > > see > >> > > > > > > any > >> > > > > > > >>> of > >> > > > > > > >>> >> the three Nodes which are also ZK servers > >> > > > > > > disconnecting/reconnecting to > >> > > > > > > >>> the > >> > > > > > > >>> >> cluster as before. However, the two Nodes which are > not > >> > > > running > >> > > > > ZK > >> > > > > > > >>> continue > >> > > > > > > >>> >> to disconnect and reconnect. The following is taken > from > >> > one > >> > > > of > >> > > > > > the > >> > > > > > > >>> non-ZK > >> > > > > > > >>> >> Nodes. It's curious that some messages are issued > twice > >> > from > >> > > > the > >> > > > > > > same > >> > > > > > > >>> >> thread, but reference a different object > >> > > > > > > >>> >> > >> > > > > > > >>> >> nifi-app.log > >> > > > > > > >>> >> 2017-05-25 13:40:01,628 INFO [main-EventTrhead] > >> > > o.a.c.f.state. > >> > > > > > > >>> ConnectionStateManager > >> > > > > > > >>> >> State change: SUSPENDED > >> > > > > > > >>> >> 2017-05-25 13:39:45,627 INFO [Clustering Tasks > Thread-1] > >> > > > > > o.a.n.c.c. > >> > > > > > > >>> ClusterProtocolHeaertbeater > >> > > > > > > >>> >> Heartbeat create at 2017-05-25 13:39:45,504 and sent > to > >> > > > > FQDN:PORT > >> > > > > > at > >> > > > > > > >>> >> 2017-05-25 13:39:45,627; send took 122 millis > >> > > > > > > >>> >> 2017-05-25 13:39:50,862 INFO [Clustering Tasks > Thread-1] > >> > > > > > o.a.n.c.c. > >> > > > > > > >>> ClusterProtocolHeaertbeater > >> > > > > > > >>> >> Heartbeat create at 2017-05-25 13:39:50,732 and sent > to > >> > > > > FQDN:PORT > >> > > > > > at > >> > > > > > > >>> >> 2017-05-25 13:39:50,862; send took 122 millis > >> > > > > > > >>> >> 2017-05-25 13:39:56,089 INFO [Clustering Tasks > Thread-1] > >> > > > > > o.a.n.c.c. > >> > > > > > > >>> ClusterProtocolHeaertbeater > >> > > > > > > >>> >> Heartbeat create at 2017-05-25 13:39:55,966 and sent > to > >> > > > > FQDN:PORT > >> > > > > > at > >> > > > > > > >>> >> 2017-05-25 13:39:56,089; send took 129 millis > >> > > > > > > >>> >> 2017-05-25 13:40:01,629 INFO > >> > > > [Curator-ConnectionStateManager-0] > >> > > > > > > >>> >> o.a.n.c.l.e.CuratorLeaderElectionManager > >> > > > > > > org.apache.nifi.controller. > >> > > > > > > >>> >> leader.election.CuratorLeaderElectionManager$ > >> > > > > > > ElectionListener@68f8b6a2 > >> > > > > > > >>> >> Connection State changed to SUSPENDED > >> > > > > > > >>> >> 2017-05-25 13:40:01,629 INFO > >> > > > [Curator-ConnectionStateManager-0] > >> > > > > > > >>> >> o.a.n.c.l.e.CuratorLeaderElectionManager > >> > > > > > > org.apache.nifi.controller. > >> > > > > > > >>> >> leader.election.CuratorLeaderElectionManager$ > >> > > > > > > ElectionListener@663f55cd > >> > > > > > > >>> >> Connection State changed to SUSPENDED > >> > > > > > > >>> >> 2017-05-25 13:40:02,412 INFO [main-EventThread] > >> > > o.a.c.f.state. > >> > > > > > > >>> ConnectinoStateManager > >> > > > > > > >>> >> State change: RECONNECTED > >> > > > > > > >>> >> 2017-05-25 13:40:02,413 INFO > >> > > > [Curator-ConnectionStateManager-0] > >> > > > > > > >>> >> o.a.n.c.l.e.CuratorLeaderElectionManager > >> > > > > > > org.apache.nifi.controller. > >> > > > > > > >>> >> leader.election.CuratorLeaderElectionManager$ > >> > > > > > > ElectionListener@68f8b6a2 > >> > > > > > > >>> >> Connection State changed to RECONNECTED > >> > > > > > > >>> >> 2017-05-25 13:40:02,413 INFO > >> > > > [Curator-ConnectionStateManager-0] > >> > > > > > > >>> >> o.a.n.c.l.e.CuratorLeaderElectionManager > >> > > > > > > org.apache.nifi.controller. > >> > > > > > > >>> >> leader.election.CuratorLeaderElectionManager$ > >> > > > > > > ElectionListener@663f55cd > >> > > > > > > >>> >> Connection State changed to RECONNECTED > >> > > > > > > >>> >> 2017-05-25 13:40:02,550 INFO [Clustering Tasks > Thread-1] > >> > > > > > o.a.n.c.c. > >> > > > > > > >>> ClusterProtocolHeaertbeater > >> > > > > > > >>> >> Heartbeat create at 2017-05-25 13:40:01,632 and sent > to > >> > > > > FQDN:PORT > >> > > > > > at > >> > > > > > > >>> >> 2017-05-25 13:40:02,550; send took 917 millis > >> > > > > > > >>> >> 2017-05-25 13:40:07,787 INFO [Clustering Tasks > Thread-1] > >> > > > > > o.a.n.c.c. > >> > > > > > > >>> ClusterProtocolHeaertbeater > >> > > > > > > >>> >> Heartbeat create at 2017-05-25 13:40:07,657 and sent > to > >> > > > > FQDN:PORT > >> > > > > > at > >> > > > > > > >>> >> 2017-05-25 13:40:07,787; send took 129 millis > >> > > > > > > >>> >> > >> > > > > > > >>> >> I will work on setting up an external ZK next, but > would > >> > > still > >> > > > > > like > >> > > > > > > some > >> > > > > > > >>> >> insight to what is being observed with the embedded > ZK. > >> > > > > > > >>> >> > >> > > > > > > >>> >> Thanks, > >> > > > > > > >>> >> Mark > >> > > > > > > >>> >> > >> > > > > > > >>> >> > >> > > > > > > >>> >> > >> > > > > > > >>> >> > >> > > > > > > >>> >> On Wed, May 24, 2017 at 3:57 PM, Mark Bean < > >> > > > > [email protected] > >> > > > > > > > >> > > > > > > >>> wrote: > >> > > > > > > >>> >> > >> > > > > > > >>> >>> Yes, we are using the embedded ZK. We will try > >> > > instantiating > >> > > > > and > >> > > > > > > >>> external > >> > > > > > > >>> >>> ZK and see if that resolves the problem. > >> > > > > > > >>> >>> > >> > > > > > > >>> >>> The load on the system is extremely small. Currently > >> (as > >> > > > Nodes > >> > > > > > are > >> > > > > > > >>> >>> disconnecting/reconnecting) all input ports to the > flow > >> > are > >> > > > > > turned > >> > > > > > > >>> off. The > >> > > > > > > >>> >>> only data in the flow is from a single GenerateFlow > >> > > > generating > >> > > > > 5B > >> > > > > > > >>> every 30 > >> > > > > > > >>> >>> secs. > >> > > > > > > >>> >>> > >> > > > > > > >>> >>> Also, it is a 5-node cluster with embedded ZK on > each > >> > node. > >> > > > > > First, > >> > > > > > > I > >> > > > > > > >>> will > >> > > > > > > >>> >>> try reducing ZK to only 3 nodes. Then, I will try a > >> > 3-node > >> > > > > > > external ZK. > >> > > > > > > >>> >>> > >> > > > > > > >>> >>> Thanks, > >> > > > > > > >>> >>> Mark > >> > > > > > > >>> >>> > >> > > > > > > >>> >>> On Wed, May 24, 2017 at 11:49 AM, Joe Witt < > >> > > > [email protected] > >> > > > > > > >> > > > > > > wrote: > >> > > > > > > >>> >>> > >> > > > > > > >>> >>>> Are you using the embedded Zookeeper? If yes we > >> > recommend > >> > > > > using > >> > > > > > > an > >> > > > > > > >>> >>>> external zookeeper. > >> > > > > > > >>> >>>> > >> > > > > > > >>> >>>> What type of load are the systems under when this > >> occurs > >> > > > (cpu, > >> > > > > > > >>> >>>> network, memory, disk io)? Under high load the > default > >> > > > > timeouts > >> > > > > > > for > >> > > > > > > >>> >>>> clustering are too aggressive. You can relax these > >> for > >> > > > higher > >> > > > > > > load > >> > > > > > > >>> >>>> clusters and should see good behavior. Even if the > >> > system > >> > > > > > > overall is > >> > > > > > > >>> >>>> not under all that high of load if you're seeing > >> garbage > >> > > > > > > collection > >> > > > > > > >>> >>>> pauses that are lengthy and/or frequent it can > cause > >> the > >> > > > same > >> > > > > > high > >> > > > > > > >>> >>>> load effect as far as the JVM is concerned. > >> > > > > > > >>> >>>> > >> > > > > > > >>> >>>> Thanks > >> > > > > > > >>> >>>> Joe > >> > > > > > > >>> >>>> > >> > > > > > > >>> >>>> On Wed, May 24, 2017 at 9:11 AM, Mark Bean < > >> > > > > > [email protected] > >> > > > > > > > > >> > > > > > > >>> >>>> wrote: > >> > > > > > > >>> >>>> > We have a cluster which is showing signs of > >> > instability. > >> > > > The > >> > > > > > > Primary > >> > > > > > > >>> >>>> Node > >> > > > > > > >>> >>>> > and Coordinator are reassigned to different nodes > >> > every > >> > > > > > several > >> > > > > > > >>> >>>> minutes. I > >> > > > > > > >>> >>>> > believe this is due to lack of heartbeat or other > >> > > > > > coordination. > >> > > > > > > The > >> > > > > > > >>> >>>> > following error occurs periodically in the > >> > nifi-app.log > >> > > > > > > >>> >>>> > > >> > > > > > > >>> >>>> > ERROR [CommitProcessor:1] > o.apache.zookeeper.server. > >> > > > > > > NIOServerCnxn > >> > > > > > > >>> >>>> > Unexpected Exception: > >> > > > > > > >>> >>>> > java.nio.channels.CancelledKeyException: null > >> > > > > > > >>> >>>> > at sun.nio.ch.SelectionKeyImpl.en > >> > > > > > > >>> >>>> sureValid(SectionKeyImpl.java:73) > >> > > > > > > >>> >>>> > at sun.nio.ch.SelectionKeyImpl.in > >> > > > > > > >>> >>>> terestOps(SelctionKeyImpl.java:77) > >> > > > > > > >>> >>>> > at > >> > > > > > > >>> >>>> > > >> org.apache.zookeeper.server.NIOServerCnxn.sendBuffer( > >> > > > > NIOServ > >> > > > > > > >>> >>>> erCnxn.java:151) > >> > > > > > > >>> >>>> > at > >> > > > > > > >>> >>>> > > >> > org.apache.zookeeper.server.NIOServerCnXn.sendResopnse( > >> > > > > NIOSe > >> > > > > > > >>> >>>> rverCnxn.java:1081) > >> > > > > > > >>> >>>> > at > >> > > > > > > >>> >>>> > org.apache.zookeeper.server. > FinalRequestProcessor. > >> > > > > processReq > >> > > > > > > >>> >>>> uest(FinalRequestProcessor.java:404) > >> > > > > > > >>> >>>> > at > >> > > > > > > >>> >>>> > > >> > org.apache.zookeeper.server.quorum.CommitProcessor.run( > >> > > > > Commi > >> > > > > > > >>> >>>> tProcessor.java:74) > >> > > > > > > >>> >>>> > > >> > > > > > > >>> >>>> > Apache NiFi 1.2.0 > >> > > > > > > >>> >>>> > > >> > > > > > > >>> >>>> > Thoughts? > >> > > > > > > >>> >>>> > >> > > > > > > >>> >>> > >> > > > > > > >>> >>> > >> > > > > > > >>> >> > >> > > > > > > >>> > >> > > > > > > > >> > > > > > > >> > > > > > >> > > > > >> > > > >> > > >> >
