You are speaking of two issues in the same breath.

1. Duplication of test case code. This is already being addressed by
Dmitriy's https://github.com/apache/mahout/pull/28. With that change all
the algo test code will co-reside with algos in math-scala.

2. Driver code for various backends. As you show in the code snippet below,
you are clearly using Spark specific calls (mc.textFile()) and bypassing
the DSL. This code is obviously spark specific and will not run on any
other backends. Now, are you asking that, you want to continue using Spark
specific functionality in the driver code and therefore how do you
reconcile that with multiple backends? I see only three options -

a) re-implement driver to only use the the DSL and avoid making backend
specific calls in the driver (not sure that is possible.)

b) continue with Spark specific calls in your driver and have per-backend
driver for each algo. This probably makes sense in a way as not all algos
run best on all backends. Just having the core of the algo backend
independent is nice enough by itself. So don't sweat about the full
pipeline not working everywhere (i.e - it need not)

c) abandon the pretense/goal that Mahout aims to be backend independent and
admit/become Spark specific.

The choice is not mine though.

On Fri, Jul 11, 2014 at 12:53 PM, Pat Ferrel <[email protected]> wrote:

> >
> > On Fri, Jul 11, 2014 at 9:29 AM, Pat Ferrel <[email protected]>
> wrote:
> >
> >> Duplicated from a comment on the PR:
> >>
> >> Beyond these details (specific merge issues)  I have a bigger problem
> with
> >> merging this. Now every time the DSL is changed it may break things in
> h2o
> >> specific code. Merging this would require every committer who might
> touch
> >> the DSL to sign up for fixing any broken tests on both engines.
> >>
> >> To solve this the entire data prep pipeline must be virtualized to run
> on
> >> either engine so the tests for things like CF and ItemSimilarity or
> matrix
> >> factorization (and the multitude of others to come) pass and are engine
> >> independent. As it stands any DSL change that breaks the build will
> have to
> >> rely on a contributor's fix. Even if one of you guys was made a
> committer
> >> we will have this problem where a needed change breaks one or the other
> >> engine specific code. Unless 99% of the entire pipeline is engine
> neutral
> >> the build will be unmaintainable.
> >>
> >> For instance I am making a small DSL change that is required for
> >> cooccurrence and ItemSimilarity to work. This would break ItemSimilarity
> >> and its tests, which are in the spark module but since I’m working on
> that
> >> I can fix everything. If someone working on an h2o specific thing had to
> >> change the DSL in a way that broke spark code like ItemSimilarity you
> might
> >> not be able to fix it and I certainly do not want to fix stuff in h2o
> >> specific code when I change the DSL. I have a hard enough time keeping
> mine
> >> running :-)
> >>
> >
> > The way I interpret the above points, the problem you are trying to
> > highlight is with having multiple backends in general, and not this
> backend
> > in specific? Hypothetically, even if this backend is abandoned for the
> > above "problems", as more backends get added in the future, the same
> > "problems" will continue to apply to all of them.
> >
>
> yes, exactly. Adding backends is only maintainable if backend specific
> code (code
> in the spark module for now) is squeezed down to near zero. The more that
> is there
> the more code there will be duplicated in the h2o modules. Test breakage
> illustrates
> the problem it does not express the breadth or depth of the problem.
>
> >
> >> Crudely speaking this means doing away with all references to a
> >> SparkContext and any use of it. So it's not just a matter of reproducing
> >> the spark module but reducing the need for one. Making it so small that
> >> breakages in one or the other engines code will be infrequent and
> changes
> >> to neutral code will only rarely break an engine that the committer is
> >> unfamiliar with.
> >>
> >
> > I think things are already very close to this "ideal" situation you
> > describe above. As a pipeline implementor we should just use
> > DistributedContext, and not SparkContext. And we need an engine neutral
> way
> > to get hold of a DistributedContext from within the math-scala module,
> like
> > this pseudocode:
> >
> >  import org.apache.mahout.math.drm._
> >
> >  val dc = DistributedContextCreate(System.getenv("MAHOUT_BACKEND"),
> > System.getenv("BACKEND_ID"), opts...)
> >
> > If environment variables are not set, DistributedContextCreate could
> > default to Spark and local. But all of the pipeline code should ideally
> > exist outside any engine specific module.
>
> The Readers and Writers rely on
>
> var columns = mc.textFile(source).map { line => line.split(delimiter) }
>
> This will not run unless the DistributedContext is actually implemented by
> SparkContext.
>
> Running item similarity on epinions dataset requires Spark Executror Memory
> to be 5g in the SparkConf so this has to be passed in to Spark, what is it
> for h2o?
> Do I as the implementor have to figure out important tuning factors for
> every engine?
>
> I need a serializer for HashBiMap to be registered with Kryo or the Spark
> version
> will not run, what analogous problems for h2o? How much time will it take
> me
> to figure it out?
>
> Answers are moot. The fact that questions come up so often is the issue. It
> took me a fair amount of time to discover these tuning and setup
> issues with only one engine.
>
> The more duplicated code the bigger this problem is
> and the greater the impedance mismatch between spark and h2o the bigger the
> problem is. This directly affects how fast Mahout is moving. If there were
> some
> clear reason for taking this productivity hit other that some idea that
> engine
> independence sounds clean or good then it would be easier to accept. Still
> so many questions and we are being asked to merge this into the mainstream?
>
> I am tired of debating this so I’ll just say that until the spark and h2o
> modules are
> tiny and trivial two engines will be a major productivity hit and so until
> the
> “ideal” is met -1 on merge.
>
> If people want to work on making the spark and h2o modules small—increasing
> engine dependence, great. But ask yourself why? Seems like if Anand
> has a build that works on both we should be able to run some non-trivial
> standard data through them on identical clusters and compare speed.
>
>

Reply via email to