While I applaud raising the issue on the mailing list to get more folks to weigh in, I think part of the problem maybe the lack of a [sahara] tag on the subject. The thread is still tagged to be a Trove centric conversation. All respondents please consider adding [sahara] to the subject.
Thanks, Kevin ________________________________________ From: Amrith Kumar [amr...@tesora.com] Sent: Thursday, January 07, 2016 1:59 PM To: OpenStack Development Mailing List (not for usage questions) Subject: Re: [openstack-dev] [trove] Adding support for HBase in Trove > -----Original Message----- > From: michael mccune [mailto:m...@redhat.com] > Sent: Thursday, January 07, 2016 3:12 PM > To: openstack-dev@lists.openstack.org > Subject: Re: [openstack-dev] [trove] Adding support for HBase in Trove > > On 01/07/2016 11:59 AM, Amrith Kumar wrote: > > From the things that you and Pete (Peter MacKinnon) are saying, I don't > understand why there is an objection to accepting the currently proposed > implementation which is clearly for single node deployments? Both > Standalone and Pseudo-Distributed are by definition, explicitly, necessarily, > absolutely, positively, definitely single node. I can't be more explicit about > that. That's all that is being proposed at this time. See more comments > below. > > i didn't think i explicitly objected to the spec, if it seems that way then i > apologize. after reading the spec and the comments, it seemed that there > was some question about engagement with the sahara team. i wanted to > help bring some light to the issues surrounding deploying hbase and thought > it would be good to participate in the discussion. You are correct Michael. There was a suggestion that we should engage with the Sahara team (in the Trove team meeting yesterday) and that is what prompted this email thread. So I appreciate your participation as one who is a member of the Sahara team. > > > Further, the current proposal also chooses an implementation strategy that > makes it much easier to handle fully-distributed in a different way in the > future. Consider this, Trove could equally well have dealt with HBase using a > single datastore for all operating modes. In the current implementation, one > would create a HBase standalone instance using a command that included: > > > > --datastore hbase-standalone > > > > And a pseudo-distributed instance by including > > > > --datastore hbase-pseudo-distributed. > > > > and this delineation sounds reasonable to me > > > Trove could equally well function by having a single datastore (hbase) but > this would make hbase-fully-distributed harder to do in a different way in the > future. I consciously eschewed that path, for this very specific reason; it > would limit choice in the future. > > agreed > > > Now, the implementation behind hbase-fully-distributed could be a > custom Trove guest agent that could (if we decided to go that route) interact > with Sahara. However, an alternative implementation of hbase-fully- > distributed could orchestrate everything natively in Trove. There is much > flexibility in the current proposal, and I submit to you that this is being > lost in > your reading of the specification and the current implementation as > proposed. > > i don't think your characterization of my reading comprehension is fair. > as i stated earlier, i wanted to participate in the discussion surrounding > deploying a technology that sahara currently deploys. fwiw, i agree with what > you are saying here, but i also think it is axiomatic, the trove team can > choose > whichever path it would like for implementation. > > >> i think this sounds reasonable, as long as we are limiting it to > >> standalone mode. if the deployments start to take on a larger scope i > >> agree it would be useful to leverage sahara for provisioning and scaling. > > > > Why only standalone? The current proposal explicitly covers only > standalone and pseudo-distributed which are both valid strictly (add other > adjectives here to taste) single node topologies and the currently submitted > specification specifically carves out fully-distributed operation as requiring > further thought and contemplation. > > i think starting with standalone mode (and not pseudo-distributed) is a more > conservative approach to this. my reason for suggesting limiting this to > standalone is that even in pseudo-distributed mode the need for managing > hdfs and zookeeper are present, i wanted to highlight some of of the overlap > and the issues that will start to creep in surrounding this deployment. > The current code (submitted for review) provides both standalone and pseudo-distributed support. You will observe that the standalone and pseudo-distributed implementations do install zookeeper. As you are no doubt aware, one of the recommended ways to force the HBase Master server to always bind to a well-known port in favor of the ephemeral ports is to stipulate hbase.cluster.distributed is True (see https://review.openstack.org/#/c/262048/5/scripts/files/elements/ubuntu-hbase-standalone/install.d/20-install-hbase line 121). So, as it turns out, the code to deploy hdfs and zookeeper is already part of the proposed implementation. > >> as the hbase installation grows beyond the standalone mode there will > >> necessarily need to be hdfs and zookeeper support to allow for a > >> proper production deployment. this also brings up questions of > >> allowing the end- users to supply configurations for the hdfs and > >> zookeeper processes, not to mention enabling support for high availability > hdfs. > > > > These are things that Trove already addresses, albeit in a different way > than Sahara. Users can, as it turns out, specify configuration groups which > can > then be used to launch new instances, and can also be associated with > groups of instances. > > i am merely identifying issues that trove will need to reproduce, i'm not > deeply familiar with the configuration options that trove exposes but i am > guessing that it is currently not generating the configurations specific to > hdfs > and zookeeper. > It is equally important, I think, to realize that Trove doesn't have to produce a whole lot of new code to handle this as it already has a robust framework that handles a number of databases. Therefore, with a relatively small code footprint a prototype that will allow much more flexible configuration support has been prototyped (that has not been sent up for review yet). The majority of that code is a codec for XML, the rest of it is almost completely handled by the framework with the exception of a file specifying the configuration options that are to be supported. Therefore, I'd like to reiterate that Trove, by its very design was intended to support a number of databases and therefore already has much of the framework in place to add support for a new database. Therefore there isn't a lot of new code that must be 'reproduced' to add this support. > >> i can envision a scenario where trove could use sahara to provision > >> and manage the clusters for hbase/hdfs/zk. this does pose some > >> questions as we'd have to determine how the trove guest agent would > >> be installed on the nodes, if there will need to be custom > >> configurations used by trove, and if sahara will need to provide a > >> plugin for bare (meaning no data processing > >> framework) hbase/hdfs/zk clusters. but, i think these could be solved > >> by either using custom images or a plugin in sahara that would > >> install the necessary agents/configurations. > > > > Let us not underestimate the effort for an end user to now deploy one > more project. To a user already using Trove for a myriad of databases, > requiring Sahara for supporting HBase Standalone sounds (to put it bluntly) a > burden. Requiring it for Fully-Distributed mode may have some development > benefits but it remains to be seen whether those benefits are really worth > the contortions that Trove would have to go through. And in the Trove > architecture, there is flexibility as described above to have multiple > possible > implementations for fully-distributed, one that would interface with Sahara > and another that didn't have to. > > i agree about the installation issues when we are talking about standalone > versus distributed. as for the contortions that trove may have to go through > to integrate with sahara, i think it would be worth it, but i'm probably > biased > here ;) > > > Let's be clear that for a person who wants a fully configurable Hadoop > based deployment with more control, Sahara may be the best option. And to > one who wants even more control, maybe doing it themselves with Nova > and customer Glance Images is the way to go. Similarly, a Database-as-a- > Service comes with the understood boundaries imposed by the "as-a- > Service" deployment. Not all configuration options may be tweakable with a > DBaaS, that's well known an understood, not just in Trove but also, for > example, in Amazon RDS, RedShift or any of the other database-as-a-service > implementations. The same would be true in fully-distributed as well, in the > proposal that is currently under review. I submit to you that this nuance is > being lost in your reading. > > i'd like to think that for someone who wants a fully configurable hadoop base > deployment, sahara is the best option =) > > i think we generally agree here about the deployment of "-aaS" services in > openstack, and again i disagree with your characterization of my reading > comprehension... > > >> of course, this does add a layer of complexity as operators who wish > >> this type of deployment will need to have both trove and sahara, but > >> imo this would be easier than replicating the work that sahara has > >> done with these technologies. > > > > I think this is where our opinions differ, as the 'replication' isn't all > > that > much given the fact that Trove already provides capabilities to cluster > databases. But, with that said, nothing in the current specification locks us > into a specific deployment strategy in the future, nor does it preclude > multiple implementations of fully-distributed, one which could leverage > Sahara and one which didn't. > > respectfully, i think there is more effort involved with the management of > the pseudo-distributed mode than standalone, and that is more where my > comments are oriented towards. mind you, provisioning might be a simple > matter for trove as it stands now, but i think the potential for issues could > get > deeper with pseudo-distributed. Here, again, I want to point out that the issues will definitely be more with pseudo-distributed than with standalone. But, Trove is already a multi-database framework and therefore adding support for one more database doesn't require a whole new implementation. > > i'm glad that you are open to the idea of implementations that may involve > other projects (namely sahara) in the future. as i said in the beginning, > given > the comments about sahara in the spec and the review i wanted to make > sure we got a few more eyes on this to bring our experience to the table. Absolutely, that's the intent of the ML conversation. > > regards, > mike > > __________________________________________________________ > ________________ > OpenStack Development Mailing List (not for usage questions) > Unsubscribe: OpenStack-dev- > requ...@lists.openstack.org?subject:unsubscribe > http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev __________________________________________________________________________ OpenStack Development Mailing List (not for usage questions) Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev __________________________________________________________________________ OpenStack Development Mailing List (not for usage questions) Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev