Re: H2O integration - completion and review

Pat Ferrel Fri, 11 Jul 2014 12:54:08 -0700

> 
> On Fri, Jul 11, 2014 at 9:29 AM, Pat Ferrel <[email protected]> wrote:
> 
>> Duplicated from a comment on the PR:
>> 
>> Beyond these details (specific merge issues)  I have a bigger problem with
>> merging this. Now every time the DSL is changed it may break things in h2o
>> specific code. Merging this would require every committer who might touch
>> the DSL to sign up for fixing any broken tests on both engines.
>> 
>> To solve this the entire data prep pipeline must be virtualized to run on
>> either engine so the tests for things like CF and ItemSimilarity or matrix
>> factorization (and the multitude of others to come) pass and are engine
>> independent. As it stands any DSL change that breaks the build will have to
>> rely on a contributor's fix. Even if one of you guys was made a committer
>> we will have this problem where a needed change breaks one or the other
>> engine specific code. Unless 99% of the entire pipeline is engine neutral
>> the build will be unmaintainable.
>> 
>> For instance I am making a small DSL change that is required for
>> cooccurrence and ItemSimilarity to work. This would break ItemSimilarity
>> and its tests, which are in the spark module but since I’m working on that
>> I can fix everything. If someone working on an h2o specific thing had to
>> change the DSL in a way that broke spark code like ItemSimilarity you might
>> not be able to fix it and I certainly do not want to fix stuff in h2o
>> specific code when I change the DSL. I have a hard enough time keeping mine
>> running :-)
>> 
> 
> The way I interpret the above points, the problem you are trying to
> highlight is with having multiple backends in general, and not this backend
> in specific? Hypothetically, even if this backend is abandoned for the
> above "problems", as more backends get added in the future, the same
> "problems" will continue to apply to all of them.
>


yes, exactly. Adding backends is only maintainable if backend specific code 
(code 
in the spark module for now) is squeezed down to near zero. The more that is 
there
the more code there will be duplicated in the h2o modules. Test breakage 
illustrates 
the problem it does not express the breadth or depth of the problem.

> 
>> Crudely speaking this means doing away with all references to a
>> SparkContext and any use of it. So it's not just a matter of reproducing
>> the spark module but reducing the need for one. Making it so small that
>> breakages in one or the other engines code will be infrequent and changes
>> to neutral code will only rarely break an engine that the committer is
>> unfamiliar with.
>> 
> 
> I think things are already very close to this "ideal" situation you
> describe above. As a pipeline implementor we should just use
> DistributedContext, and not SparkContext. And we need an engine neutral way
> to get hold of a DistributedContext from within the math-scala module, like
> this pseudocode:
> 
>  import org.apache.mahout.math.drm._
> 
>  val dc = DistributedContextCreate(System.getenv("MAHOUT_BACKEND"),
> System.getenv("BACKEND_ID"), opts...)
> 
> If environment variables are not set, DistributedContextCreate could
> default to Spark and local. But all of the pipeline code should ideally
> exist outside any engine specific module.

The Readers and Writers rely on

var columns = mc.textFile(source).map { line => line.split(delimiter) }

This will not run unless the DistributedContext is actually implemented by 
SparkContext. 

Running item similarity on epinions dataset requires Spark Executror Memory
to be 5g in the SparkConf so this has to be passed in to Spark, what is it for 
h2o? 
Do I as the implementor have to figure out important tuning factors for every 
engine?

I need a serializer for HashBiMap to be registered with Kryo or the Spark 
version 
will not run, what analogous problems for h2o? How much time will it take me
to figure it out?

Answers are moot. The fact that questions come up so often is the issue. It 
took me a fair amount of time to discover these tuning and setup
issues with only one engine.

The more duplicated code the bigger this problem is
and the greater the impedance mismatch between spark and h2o the bigger the
problem is. This directly affects how fast Mahout is moving. If there were some
clear reason for taking this productivity hit other that some idea that engine 
independence sounds clean or good then it would be easier to accept. Still
so many questions and we are being asked to merge this into the mainstream?

I am tired of debating this so I’ll just say that until the spark and h2o 
modules are
tiny and trivial two engines will be a major productivity hit and so until the 
“ideal” is met -1 on merge.

If people want to work on making the spark and h2o modules small—increasing 
engine dependence, great. But ask yourself why? Seems like if Anand 
has a build that works on both we should be able to run some non-trivial 
standard data through them on identical clusters and compare speed.

Re: H2O integration - completion and review

Reply via email to