initial idea is to only provide algebraic independence. we may want to ask engines to support persistence operations to/from, as it stands, HDFS, as currently couple dozen of projects dealing with distributed data ask as well, but in general algebraic expressions are agnostic from how inputs come into existence.
when building e2e distributed application, naturally, algebra is not enough. Heck, it is not even enough even for a modereately involved logics inside an algorithm, so quasi-algebraic algorithms are expected. This is an obstinent reality. But engine independence, or even partial portability is only one side of the story, and it is not the biggest one. So the hope is that (1) algebraic part hopefully is still significant enough so that non-algebraic part of algorithm could be more easily ported if needed; or (2) for folks like me, one version of algorithm is quite enough and engine independence side of the story becomes much smaller story making other sides of the story (i.e. convenience and semantics of algebraic translation itself) much more prominent. What it means, quasi-portable algorithms are expected to happen, and i wouldn't be overly heartbroken about adding things only to spark side of things -- either as a first port, or even for good. After all, I am all for solving problems that actually exist. I probably have a need for coocurrence work with Spark deployment but i have no need for CF on H20 so i wouldn't care if quasi-port exists for h20. Folks who do, are welcome to contribute a quasi-algebraic port. On Fri, Jul 11, 2014 at 12:53 PM, Pat Ferrel <[email protected]> wrote: > > > > On Fri, Jul 11, 2014 at 9:29 AM, Pat Ferrel <[email protected]> > wrote: > > > >> Duplicated from a comment on the PR: > >> > >> Beyond these details (specific merge issues) I have a bigger problem > with > >> merging this. Now every time the DSL is changed it may break things in > h2o > >> specific code. Merging this would require every committer who might > touch > >> the DSL to sign up for fixing any broken tests on both engines. > >> > >> To solve this the entire data prep pipeline must be virtualized to run > on > >> either engine so the tests for things like CF and ItemSimilarity or > matrix > >> factorization (and the multitude of others to come) pass and are engine > >> independent. As it stands any DSL change that breaks the build will > have to > >> rely on a contributor's fix. Even if one of you guys was made a > committer > >> we will have this problem where a needed change breaks one or the other > >> engine specific code. Unless 99% of the entire pipeline is engine > neutral > >> the build will be unmaintainable. > >> > >> For instance I am making a small DSL change that is required for > >> cooccurrence and ItemSimilarity to work. This would break ItemSimilarity > >> and its tests, which are in the spark module but since I’m working on > that > >> I can fix everything. If someone working on an h2o specific thing had to > >> change the DSL in a way that broke spark code like ItemSimilarity you > might > >> not be able to fix it and I certainly do not want to fix stuff in h2o > >> specific code when I change the DSL. I have a hard enough time keeping > mine > >> running :-) > >> > > > > The way I interpret the above points, the problem you are trying to > > highlight is with having multiple backends in general, and not this > backend > > in specific? Hypothetically, even if this backend is abandoned for the > > above "problems", as more backends get added in the future, the same > > "problems" will continue to apply to all of them. > > > > yes, exactly. Adding backends is only maintainable if backend specific > code (code > in the spark module for now) is squeezed down to near zero. The more that > is there > the more code there will be duplicated in the h2o modules. Test breakage > illustrates > the problem it does not express the breadth or depth of the problem. > > > > >> Crudely speaking this means doing away with all references to a > >> SparkContext and any use of it. So it's not just a matter of reproducing > >> the spark module but reducing the need for one. Making it so small that > >> breakages in one or the other engines code will be infrequent and > changes > >> to neutral code will only rarely break an engine that the committer is > >> unfamiliar with. > >> > > > > I think things are already very close to this "ideal" situation you > > describe above. As a pipeline implementor we should just use > > DistributedContext, and not SparkContext. And we need an engine neutral > way > > to get hold of a DistributedContext from within the math-scala module, > like > > this pseudocode: > > > > import org.apache.mahout.math.drm._ > > > > val dc = DistributedContextCreate(System.getenv("MAHOUT_BACKEND"), > > System.getenv("BACKEND_ID"), opts...) > > > > If environment variables are not set, DistributedContextCreate could > > default to Spark and local. But all of the pipeline code should ideally > > exist outside any engine specific module. > > The Readers and Writers rely on > > var columns = mc.textFile(source).map { line => line.split(delimiter) } > > This will not run unless the DistributedContext is actually implemented by > SparkContext. > > Running item similarity on epinions dataset requires Spark Executror Memory > to be 5g in the SparkConf so this has to be passed in to Spark, what is it > for h2o? > Do I as the implementor have to figure out important tuning factors for > every engine? > > I need a serializer for HashBiMap to be registered with Kryo or the Spark > version > will not run, what analogous problems for h2o? How much time will it take > me > to figure it out? > > Answers are moot. The fact that questions come up so often is the issue. It > took me a fair amount of time to discover these tuning and setup > issues with only one engine. > > The more duplicated code the bigger this problem is > and the greater the impedance mismatch between spark and h2o the bigger the > problem is. This directly affects how fast Mahout is moving. If there were > some > clear reason for taking this productivity hit other that some idea that > engine > independence sounds clean or good then it would be easier to accept. Still > so many questions and we are being asked to merge this into the mainstream? > > I am tired of debating this so I’ll just say that until the spark and h2o > modules are > tiny and trivial two engines will be a major productivity hit and so until > the > “ideal” is met -1 on merge. > > If people want to work on making the spark and h2o modules small—increasing > engine dependence, great. But ask yourself why? Seems like if Anand > has a build that works on both we should be able to run some non-trivial > standard data through them on identical clusters and compare speed. > >
