On Fri, Jul 11, 2014 at 9:29 AM, Pat Ferrel <[email protected]> wrote:
> Duplicated from a comment on the PR:
>
> Beyond these details (specific merge issues) I have a bigger problem with
> merging this. Now every time the DSL is changed it may break things in h2o
> specific code. Merging this would require every committer who might touch
> the DSL to sign up for fixing any broken tests on both engines.
>
> To solve this the entire data prep pipeline must be virtualized to run on
> either engine so the tests for things like CF and ItemSimilarity or matrix
> factorization (and the multitude of others to come) pass and are engine
> independent. As it stands any DSL change that breaks the build will have to
> rely on a contributor's fix. Even if one of you guys was made a committer
> we will have this problem where a needed change breaks one or the other
> engine specific code. Unless 99% of the entire pipeline is engine neutral
> the build will be unmaintainable.
>
> For instance I am making a small DSL change that is required for
> cooccurrence and ItemSimilarity to work. This would break ItemSimilarity
> and its tests, which are in the spark module but since I’m working on that
> I can fix everything. If someone working on an h2o specific thing had to
> change the DSL in a way that broke spark code like ItemSimilarity you might
> not be able to fix it and I certainly do not want to fix stuff in h2o
> specific code when I change the DSL. I have a hard enough time keeping mine
> running :-)
>
The way I interpret the above points, the problem you are trying to
highlight is with having multiple backends in general, and not this backend
in specific? Hypothetically, even if this backend is abandoned for the
above "problems", as more backends get added in the future, the same
"problems" will continue to apply to all of them.
> Crudely speaking this means doing away with all references to a
> SparkContext and any use of it. So it's not just a matter of reproducing
> the spark module but reducing the need for one. Making it so small that
> breakages in one or the other engines code will be infrequent and changes
> to neutral code will only rarely break an engine that the committer is
> unfamiliar with.
>
I think things are already very close to this "ideal" situation you
describe above. As a pipeline implementor we should just use
DistributedContext, and not SparkContext. And we need an engine neutral way
to get hold of a DistributedContext from within the math-scala module, like
this pseudocode:
import org.apache.mahout.math.drm._
val dc = DistributedContextCreate(System.getenv("MAHOUT_BACKEND"),
System.getenv("BACKEND_ID"), opts...)
If environment variables are not set, DistributedContextCreate could
default to Spark and local. But all of the pipeline code should ideally
exist outside any engine specific module.
> I raised this red flag a long time ago but in the heat of other issues it
> got lost. I don't think this can be ignored anymore.
>
The only missing piece I think is having a DistributedContextCreate() call
such as above? I don't think things are in such a dire state really.. Am I
missing something?
> I would propose that we should remain two separate projects with a mostly
> shared DSL until the maintainability issues are resolved. This seems way to
> early to merge.
>
Call me an optimist, but I was hoping more of a "let's work together now to
make the DSL abstractions easier for future contributors". I will explore
such a DistributedContextCreate() method in math-scala. That might also be
the answer for test cases to remain in math-scala.
Thanks