Either via extension point, or OSGi service, yes. While I agree we would like user involvement to be minimal (even compared to SVN), the user will be involved in a few ways:
1) Encountering a situation where a connector is needed that is not present. Hopefully, this can be presented to him in unambiguous and clear terms. 2) Locating the needed connector. 3) Installing it. We can avoid this most of the time, by including a full range of connectors with the original update site. And if it's just a new version of Hadoop, a simple update from the original site can handle it. If we can induce Cloudera to supply the connector for their releases, even better. But the potential will always be there, for the need for a connector to some different environment. So to that extent, I think a connector is a user-visible concept. I hope we can avoid the situation where the user has to make a decision about what connector to use. For that to work reliably, we'd need to be sure the network protocols provide sufficient information to clearly identify the target environment. There may be some issues to resolve there on the Hadoop side. I think we should consider SSH and local and grid connectors as well. Not as core offerings (unless people find them very useful, perhaps in administering a cluster). I think of them as outlier cases that would help keep the architecture honest, forcing us to consider, "if this functionality is not available, what should we do"? And I have a suspicion that we'd find that very useful with our products. Note that I'm not suggesting we provide full functionality, but rather that we provide API and UI that at the level of "run a job" or "see if the job has finished" or "look at files", can be agnostic about the back end and its specific capabilities. This can lead to an architecture where those specific capabilities can be exposed incrementally by the individual connector, rather than by an upgrade of the core. For example, I think connectors should provide metadata that allows configuration UI fragments to be constructed, so a wizard to submit a job can expose options that are specific to the target. (Metadata would be better than supplying actual UI fragments, because it allows more reuse, for example, in a web-based interface, and helps ensure consistency). Also, I think we should expose a clean API that gives access to the functionality made available by the connectors, so it can be use by other tools besides our API. For example, at FICO, this would let us more closely integrate with our tools, and would encourage people to contribute additional useful tools. I think the connector architecture should also support parts of the ecosystem beyond MapReduce and HDFS. For example, Hive and HBase. This might be as simple as a design pattern, and a shared registry for connectors. We could drive this with separate connectors for MapReduce and HDFS, rather than one monolithic connector. (We would not need to package the MapReduce and HDFS connectors as two bundles -- just register two connectors to service the two functional areas). I think all connectors would want to work with a shared description of the target environment. In fact, you'd like to load start up by loading a descriptor that describes the target environment, and this would drive selection of the proper connectors from the available set. (To what extent the existing configuration files meet the need is an open question, but a pointer to a directory containing them would do as a starting point.) The point is, configuration data describes the target, and so should be conceptually shared, rather than specified (with partial redundancy) for each type of connector. Exactly how this bootstraps, and how much configuration is in a static descriptor, vs dynamic discovery (i.e., OK, we have Hadoop, but which version? OK, we have Cassandra, but which version?) is something we'd need to work out. If you check the network protocol to discover protocol version, you have a chicken-and-the-egg problem. If you get it from a static specification, you have maintenance and accuracy headaches. We'd like connectors and the core HDT to be as independent of each other as possible. On Fri, Jan 11, 2013 at 11:27 AM, Adam Berry <[email protected]> wrote: > I think we will need a discussion on how to work the connector/multiple > version side of this. There are a couple of different ways, and a few > subtle issues. > > But, I think however we actually supply the implementation for versions of > hadoop, they should be contributed via extension points. This would make it > easiest to add newer versions down the line, and would also make it > possible for connectors that support things like CDH, if anything special > is required. Not that I'm suggesting that we would do these vendor > connectors, just that it makes most sense to make these extensible. > > On Fri, Jan 11, 2013 at 10:02 AM, Simone Gianni <[email protected]> > wrote: > > > I'm also +1 for Tycho. > > > > When talking about connectors, its IMHO fine to use them as an internal > > abstraction, but I don't know if it's a good idea (or useful) to expose > > connectors to the user as SVN plugins do (they need to do that for > licence > > problems). Were you talking about internal abstractions for different > > versions? > > > > Simone > > > > > > 2013/1/10 Adam Berry <[email protected]> > > > > > Hi everyone, > > > > > > First, I've dropped the code from Hadoop contrib into our git repo, its > > on > > > its own branch, hadoop-contrib. The reason I put it on a branch is > > because > > > I think that splitting things up a little would be a good idea, and > > should > > > make it a little easier to support multiple versions of Hadoop. > > > > > > So, the tools as they stand are just in one plugin. Broadly, the > features > > > right now can be divided into; > > > > > > MapReduce project and class code support (wizards etc) > > > Launch support for Hadoop > > > HDFS interaction > > > > > > So taking a root name space of org.apache.hdt, I suggest something like > > the > > > following for the plugin names > > > > > > org.apache.hdt.core > > > > > > org.apache.hdt.ui > > > > > > org.apache.hdt.debug.core > > > org.apache.hdt.debug.ui > > > > > > org.apache.hdt.hdfs.core > > > org.apache.hdt.hdfs.ui > > > > > > org.apache.hdt.help > > > > > > These may be a little fluid as we get into the details here, but from > > 10000 > > > feet it looks ok. > > > > > > Finally, I would also like to suggest Tycho (Maven plugin for doing > > Eclipse > > > build stuff) as our build tool. I've done my fair share of pure Ant PDE > > > build stuff over the years, and Tycho is vastly easier, and would make > it > > > much easier for people to build themselves without having to do a bunch > > of > > > local setup first. > > > > > > Thoughts? If everyone thinks these are ok, I'll enter some issues and > get > > > cracking. > > > > > > Cheers, > > > Adam > > > > > >
