Hi Till, I am glad that you're interested and let me first say that a change has been submitted and reviewed by Murtadha, now being reviewed by Chris. Not only this but I first implemented it completely using RMI and then re-implemented it completely using Zookeeper.
All what you stated is correct. The solution that was implemented only deals with knowing when the cluster is up during the startup process. This seemed urgent to me since I am facing it with almost every change that I try to verify before I push to Gerrit, and others have seen it too. Knowing the state of the cluster (i.e. through the Managix describe command) still relies on checking if the processes are running (Someone correct me if this is wrong). So what I did is the following: When Managix starts the CC, it simply listens on Zookeeper until CC reports its state. This is currently only done during the startup process. As Ian has said, he was/is using a polling mechanism to determine if the server is up. I still think what we implemented is a more elegant solution that doesn't involve polling at all. Anyone is welcome to look at the change, suggest changes to it before we merge it :-) ~Abdullah. On Fri, Aug 28, 2015 at 8:58 AM, Till Westmann <[email protected]> wrote: > I’m not really deep into this topic, but I’d like to understand a little > better. > > As I understand it, we currently have 2 ways to deploy/manage AsterixDB: > a) using Managix and b) using YARN. > And Managix uses Zookeeper to mange its information, but YARN doesn’t. > Also, neither the Asterix CC or NC depend on the existence of Zookeeper. > > Is this correct so far? > > Now we are trying to find a way to ensure that an AsterixDB client can > reliably know if the cluster is up or down. > > My first assumption for the properties that the solution to this problem > would have is: > 1) The knowledge if the cluster is up or down is available in the CC (as > it controls the cluster). > 2) The mechanism used to expose that information works for both ways to > deploy/manage a cluster. > > As simple way to do that seems to be to send a request “waitUntilStarted” > to the CC that returns to the client once the CC has determined that > everything has started. The response to that request would either be “yes" > (cluster is up), “no” (an error occurred and it won’t be up without > intervention), or “not sure” (timeout - please ask again later). This would > imply that the client is polling, but it wouldn’t be very busy if the > timeout is reasonable. > > Now this doesn’t seem to be where the discussion is going and I’d like to > find out where is is going and why. > > Could you help me? > > Thanks, > Till > > > > On Aug 25, 2015, at 7:23 AM, Raman Grover <[email protected]> > wrote: > > > > As I mentioned before... > > "The information for an AsterixDB instance is "lazily" refreshed when a > > management operation is invoked (using managix set of commands) or an > > explicit describe command is invoked. " > > > > Above, the commands are the Managix set of commands (create, start, > > describe etc.) that trigger a refresh and so its "lazy". Currently CC > does > > not notify Managix. what we are discussing are the elegant way to have CC > > relay information to Managix. > > > > On Tue, Aug 25, 2015 at 4:10 AM, abdullah alamoudi <[email protected]> > > wrote: > > > >> I don't think that is there yet but the intention is to have it at some > >> point in the future. > >> > >> Cheers, > >> Abdullah. > >> > >> On Tue, Aug 25, 2015 at 12:38 PM, Chris Hillery <[email protected]> > >> wrote: > >> > >>> Very interesting, thank you. Can you point out a couple places in the > >> code > >>> where some of this logic is kept? Specifically where "CC can update > this > >>> information and notify Managix" sounds interesting... > >>> > >>> Ceej > >>> aka Chris Hillery > >>> > >>> On Tue, Aug 25, 2015 at 12:49 AM, Raman Grover < > [email protected]> > >>> wrote: > >>> > >>>>> , and what code is > >>>>> responsible for keeping it up-to-date? > >>>>> > >>>> Apparently, no one is :-) > >>>> > >>>> The information for an AsterixDB instance is "lazily" refreshed when a > >>>> management operation is invoked (using managix set of commands) or an > >>>> explicit describe command is invoked. > >>>> Between the time t1 (when state of an AsterixDB instance changes, say > >> due > >>>> to NC failure) and t2 (when a management operation is invoked), the > >>>> information about the AsterixDB instance inside Zookeeper remains > >> stale. > >>> CC > >>>> can update this information and notify Managix; this way Managix > >> realizes > >>>> the changed state as soon as it has occurred. This can be particularly > >>>> useful when showing on a management console the up-to-date state of an > >>>> instance in real time or having Managix respond to an event. > >>>> > >>>> Regards, > >>>> Raman > >>>> > >>>> ---------- Forwarded message ---------- > >>>> From: abdullah alamoudi <[email protected]> > >>>> Date: Tue, Aug 25, 2015 at 12:27 AM > >>>> Subject: Re: The solution to the sporadic connection refused > exceptions > >>>> To: [email protected] > >>>> > >>>> > >>>> On Tue, Aug 25, 2015 at 3:40 AM, Chris Hillery <[email protected] > > > >>>> wrote: > >>>> > >>>>> Perhaps an aside, but: exactly what is kept in Zookeeper > >>>> > >>>> > >>>> A serialized instance of > >> edu.uci.ics.asterix.event.model.AsterixInstance > >>>> > >>>> > >>>>> , and what code is > >>>>> responsible for keeping it up-to-date? > >>>>> > >>>> Apparently, no one is :-) > >>>> > >>>> > >>>>> > >>>>> Ceej > >>>>> > >>>>> On Mon, Aug 24, 2015 at 5:28 PM, Raman Grover < > >> [email protected] > >>>> > >>>>> wrote: > >>>>> > >>>>>> Well, the state of an instance (and metadata including > >> configuration) > >>>> is > >>>>>> kept in Zookeeper instance that is accessible to Managix and CC. CC > >>>>> should > >>>>>> be able to set the state of the cluster in Zookeeper under the > >> right > >>>>> znode > >>>>>> which can viewed by Managix. > >>>>>> > >>>>>> There exists a communication channel for CC and Managix to share > >>>>>> information on state etc. I am not sure if we need another channel > >>> such > >>>>> as > >>>>>> RMI between Managix and CC. > >>>>>> > >>>>>> Regards, > >>>>>> Raman > >>>>>> > >>>>>> > >>>>>> > >>>>>> On Mon, Aug 24, 2015 at 12:58 PM, abdullah alamoudi < > >>>> [email protected]> > >>>>>> wrote: > >>>>>> > >>>>>>> Well, it depends on your definition of the boundaries of managix. > >>>> What > >>>>> I > >>>>>>> did is that I added an RMI object in the InstallerDriver which > >>>>> basically > >>>>>>> listen for state changes from the cluster controller. This means > >>> some > >>>>>>> additional logic in the CCApplicationEntryPoint where after the > >> CC > >>> is > >>>>>>> ready, it contacts the InstallerDriver using RMI and at that > >> point > >>>>> only, > >>>>>>> the InstallerDriver can return to managix and tells it that the > >>>> startup > >>>>>> is > >>>>>>> complete. > >>>>>>> > >>>>>>> Not sure if this is the right way to do it but it definitely is > >>>> better > >>>>>> than > >>>>>>> what we currently have. > >>>>>>> Abdullah. > >>>>>>> > >>>>>>> On Mon, Aug 24, 2015 at 10:00 PM, Chris Hillery > >>>> <[email protected] > >>>>>> > >>>>>>> wrote: > >>>>>>> > >>>>>>>> Hopefully the solution won't involve additional important logic > >>>>> inside > >>>>>>>> Managix itself? > >>>>>>>> > >>>>>>>> Ceej > >>>>>>>> aka Chris Hillery > >>>>>>>> > >>>>>>>> On Mon, Aug 24, 2015 at 7:26 AM, abdullah alamoudi < > >>>>> [email protected] > >>>>>>> > >>>>>>>> wrote: > >>>>>>>> > >>>>>>>>> That works but it doesn't feel right doing it this way. I am > >>>> going > >>>>> to > >>>>>>> fix > >>>>>>>>> this one for good. > >>>>>>>>> > >>>>>>>>> Cheers, > >>>>>>>>> Abdullah. > >>>>>>>>> > >>>>>>>>> On Mon, Aug 24, 2015 at 5:11 PM, Ian Maxon <[email protected]> > >>>> wrote: > >>>>>>>>> > >>>>>>>>>> The way I assured liveness for the YARN installer was to > >> try > >>>>>> running > >>>>>>>> "for > >>>>>>>>>> $x in dataset Metadata.Dataset return $x" via the API. I > >> just > >>>>>> polled > >>>>>>>> for > >>>>>>>>> a > >>>>>>>>>> reasonable amount of time (though honestly, thinking about > >>> it > >>>>> now, > >>>>>>> the > >>>>>>>>>> correct parameter to use for the polling interval is the > >>>> startup > >>>>>> wait > >>>>>>>>> time > >>>>>>>>>> in the parameters file :) ). It's not perfect, but it gives > >>>> less > >>>>>>> false > >>>>>>>>>> positives than just checking ps for processes that look > >> like > >>>>>> CCs/NCs. > >>>>>>>>>> > >>>>>>>>>> - Ian. > >>>>>>>>>> > >>>>>>>>>> On Mon, Aug 24, 2015 at 5:03 AM, abdullah alamoudi < > >>>>>>> [email protected] > >>>>>>>>> > >>>>>>>>>> wrote: > >>>>>>>>>> > >>>>>>>>>>> Now that I think about it. Maybe we should provide > >> multiple > >>>>> ways > >>>>>> to > >>>>>>>> do > >>>>>>>>>>> this. A polling mechanism to be used for arbitrary time > >>> and a > >>>>>>> pushing > >>>>>>>>>>> mechanism on startup. > >>>>>>>>>>> I am going to start implementation of this and will > >>> probably > >>>>> use > >>>>>>> RMI > >>>>>>>>> for > >>>>>>>>>>> this task both ways (CC to InstallerDriver and > >>>> InstallerDriver > >>>>> to > >>>>>>>> CC). > >>>>>>>>>>> > >>>>>>>>>>> Cheers, > >>>>>>>>>>> Abdullah. > >>>>>>>>>>> > >>>>>>>>>>> On Mon, Aug 24, 2015 at 2:19 PM, abdullah alamoudi < > >>>>>>>> [email protected] > >>>>>>>>>> > >>>>>>>>>>> wrote: > >>>>>>>>>>> > >>>>>>>>>>>> So after further investigation, turned out our startup > >>>>> process > >>>>>>> just > >>>>>>>>>>> starts > >>>>>>>>>>>> the CC and NC processes and then make sure the > >> processes > >>>> are > >>>>>>>> running > >>>>>>>>>> and > >>>>>>>>>>> if > >>>>>>>>>>>> the processes were found to be running, it returns the > >>>> state > >>>>> of > >>>>>>> the > >>>>>>>>>>> cluster > >>>>>>>>>>>> to be active and the subsequent test commands can start > >>>>>>>> immediately. > >>>>>>>>>>>> > >>>>>>>>>>>> This means that the CC could've started but is not yet > >>>> ready > >>>>>> when > >>>>>>>> we > >>>>>>>>>> try > >>>>>>>>>>>> to process the next command. To address this, we need a > >>>>> better > >>>>>>> way > >>>>>>>> to > >>>>>>>>>>> tell > >>>>>>>>>>>> when the startup procedure has completed. we can do > >> this > >>> by > >>>>>>> pushing > >>>>>>>>> (CC > >>>>>>>>>>>> informs installer driver when the startup is complete) > >> or > >>>>>> polling > >>>>>>>>> (The > >>>>>>>>>>>> installer driver needs to actually query the CC for the > >>>> state > >>>>>> of > >>>>>>>> the > >>>>>>>>>>>> cluster). > >>>>>>>>>>>> > >>>>>>>>>>>> I can do either way so let's vote. My vote goes to the > >>>>> pushing > >>>>>>>>>> mechanism. > >>>>>>>>>>>> Thoughts? > >>>>>>>>>>>> > >>>>>>>>>>>> On Mon, Aug 24, 2015 at 10:15 AM, abdullah alamoudi < > >>>>>>>>>> [email protected]> > >>>>>>>>>>>> wrote: > >>>>>>>>>>>> > >>>>>>>>>>>>> This solution turned out to be incorrect. Actually, > >> the > >>>> test > >>>>>>> cases > >>>>>>>>>> when > >>>>>>>>>>> I > >>>>>>>>>>>>> build after using the join method never fails but > >>> running > >>>> an > >>>>>>>> actual > >>>>>>>>>>> asterix > >>>>>>>>>>>>> instance never succeeds which is quite confusing. > >>>>>>>>>>>>> > >>>>>>>>>>>>> I also think that the startup script has a major bug > >>> where > >>>>> it > >>>>>>>> might > >>>>>>>>>>>>> returns before the startup is complete. More on this > >>>>>> later...... > >>>>>>>>>>>>> > >>>>>>>>>>>>> On Mon, Aug 24, 2015 at 7:48 AM, abdullah alamoudi < > >>>>>>>>>> [email protected]> > >>>>>>>>>>>>> wrote: > >>>>>>>>>>>>> > >>>>>>>>>>>>>> It is highly unlikely that it is related. > >>>>>>>>>>>>>> > >>>>>>>>>>>>>> Cheers, > >>>>>>>>>>>>>> Abdullah. > >>>>>>>>>>>>>> > >>>>>>>>>>>>>> On Mon, Aug 24, 2015 at 5:45 AM, Chen Li < > >>>> [email protected] > >>>>>> > >>>>>>>> wrote: > >>>>>>>>>>>>>> > >>>>>>>>>>>>>>> @Abdullah: Is this issue related to > >>>>>>>>>>>>>>> > >> https://issues.apache.org/jira/browse/ASTERIXDB-1074? > >>>> Ian > >>>>>>> and I > >>>>>>>>>> plan > >>>>>>>>>>> to > >>>>>>>>>>>>>>> look into the details on Monday. > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> On Sun, Aug 23, 2015 at 10:08 AM, abdullah alamoudi > >> < > >>>>>>>>>>> [email protected] > >>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> wrote: > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>> About 3-4 days ago, I was working on the addition > >> of > >>>> the > >>>>>>>>>> filesystem > >>>>>>>>>>>>>>> based > >>>>>>>>>>>>>>>> feed adapter and it didn't take anytime to > >> complete. > >>>>>>> However, > >>>>>>>>>> when I > >>>>>>>>>>>>>>> wanted > >>>>>>>>>>>>>>>> to build and make sure all tests pass, I kept > >>> getting > >>>>>>>>>>>>>>> ConnectionRefused > >>>>>>>>>>>>>>>> errors which caused the installer tests to fail > >>> every > >>>>> now > >>>>>>> and > >>>>>>>>>> then. > >>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>> I knew the new change had nothing to do with this > >>>>> failure, > >>>>>>>> yet, > >>>>>>>>> I > >>>>>>>>>>>>>>> couldn't > >>>>>>>>>>>>>>>> direct my attention away from this bug (It just > >>>> bothered > >>>>>> me > >>>>>>> so > >>>>>>>>>> much > >>>>>>>>>>>>>>> and I > >>>>>>>>>>>>>>>> knew it needs to be resolved ASAP). After wasting > >>>>>> countless > >>>>>>>>>> hours, I > >>>>>>>>>>>>>>> was > >>>>>>>>>>>>>>>> finally able to figure out what was happening :-) > >>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>> In the startup routine, we start three Jetty web > >>>> servers > >>>>>>> (Web > >>>>>>>>>>>>>>> interface > >>>>>>>>>>>>>>>> server, JSON API server, and Feed server). > >> Sometime > >>>> ago, > >>>>>> we > >>>>>>>> used > >>>>>>>>>> to > >>>>>>>>>>>>>>> end the > >>>>>>>>>>>>>>>> startup call before making sure the > >>> server.isStarted() > >>>>>>> method > >>>>>>>>>>> returns > >>>>>>>>>>>>>>> true > >>>>>>>>>>>>>>>> on all servers. At that time, I introduced the > >>>>>>>>>> waitUntilServerStarts > >>>>>>>>>>>>>>> method > >>>>>>>>>>>>>>>> to make sure we don't return before the servers > >> are > >>>>> ready. > >>>>>>>>> Turned > >>>>>>>>>>>>>>> out, that > >>>>>>>>>>>>>>>> was an incorrect way to handle this (We can blame > >>>>>>>> stackoverflow > >>>>>>>>>> for > >>>>>>>>>>>>>>> this > >>>>>>>>>>>>>>>> one!) and it is not enough that the server > >>> isStarted() > >>>>>>> returns > >>>>>>>>>> true. > >>>>>>>>>>>>>>> The > >>>>>>>>>>>>>>>> correct way to do this is to call the > >> server.join() > >>>>> method > >>>>>>>> after > >>>>>>>>>> the > >>>>>>>>>>>>>>>> server.start(). > >>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>> See: > >>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> > >>>>>>>>>>> > >>>>>>>>>> > >>>>>>>>> > >>>>>>>> > >>>>>>> > >>>>>> > >>>>> > >>>> > >>> > >> > http://stackoverflow.com/questions/15924874/embedded-jetty-why-to-use-join > >>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>> This was equally satisfying as it was frustrating > >>> and > >>>>> you > >>>>>>> are > >>>>>>>>>>> welcome > >>>>>>>>>>>>>>> for > >>>>>>>>>>>>>>>> the future time I saved each of you :) > >>>>>>>>>>>>>>>> -- > >>>>>>>>>>>>>>>> Amoudi, Abdullah. > >>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>> > >>>>>>>>>>>>>> > >>>>>>>>>>>>>> > >>>>>>>>>>>>>> -- > >>>>>>>>>>>>>> Amoudi, Abdullah. > >>>>>>>>>>>>>> > >>>>>>>>>>>>> > >>>>>>>>>>>>> > >>>>>>>>>>>>> > >>>>>>>>>>>>> -- > >>>>>>>>>>>>> Amoudi, Abdullah. > >>>>>>>>>>>>> > >>>>>>>>>>>> > >>>>>>>>>>>> > >>>>>>>>>>>> > >>>>>>>>>>>> -- > >>>>>>>>>>>> Amoudi, Abdullah. > >>>>>>>>>>>> > >>>>>>>>>>> > >>>>>>>>>>> > >>>>>>>>>>> > >>>>>>>>>>> -- > >>>>>>>>>>> Amoudi, Abdullah. > >>>>>>>>>>> > >>>>>>>>>> > >>>>>>>>> > >>>>>>>>> > >>>>>>>>> > >>>>>>>>> -- > >>>>>>>>> Amoudi, Abdullah. > >>>>>>>>> > >>>>>>>> > >>>>>>> > >>>>>>> > >>>>>>> > >>>>>>> -- > >>>>>>> Amoudi, Abdullah. > >>>>>>> > >>>>>> > >>>>>> > >>>>>> > >>>>>> -- > >>>>>> Raman > >>>>>> > >>>>> > >>>> > >>>> > >>>> > >>>> -- > >>>> Amoudi, Abdullah. > >>>> > >>>> > >>>> > >>>> -- > >>>> Raman > >>>> > >>> > >> > >> > >> > >> -- > >> Amoudi, Abdullah. > >> > > > > > > > > -- > > Raman > > -- Amoudi, Abdullah.
