A bit more detail on what needs to happen here IMO:

Likely, hadoop-releated things we still need for spark etc. like
VectorWritable need to be factored out into a (new) module mahout-hadoop or
something. Important notion here is that we only want to depend on
hadoop-commons, which in theory should be common for both new and old
hadoop MR apis. We may face the fact that we need hdfs as well there; e.g.
perhaps for reading sequence file headers, not sure; but we definitely do
not need anything mapreduce.

Math still cannot depend on that mahout-hadoop since math must not depend
on anything hadoop, that was the premise since like the beginning.
Mahout-math is in-core ops only, lightweight, self-contained thing.

more likely, spark module (and maybe some others if they use that) will
have to depend on hadoop serialization for vectors and matrices directly,
i.e. on mahout-hadoop.

mrlegacy stuff of course needs to be completely isolated (nobody else
depends on it) and made dependent on mahout-hadoop as well.

On Fri, Dec 12, 2014 at 9:38 AM, Pat Ferrel <[email protected]> wrote:
>
> The next time someone wants to get into contributing to Mahout, wouldn’t
> it be nice to prune dependencies?
>
> For instance Spark depends on math-scala, which depends on math—at least
> ideally but in reality dependencies include mr-legacy. If some things were
> refactored into math we might have a much streamlined dependency tree. Some
> things in Math also can be replaced with newer Scala libs and so could be
> moved out to a java-common or something that would not be required by the
> Scala code.
>
> If people are going to use the V1 version of Mahout it would be nice if
> the choice didn’t force them to drag along all the legacy code if it isn’t
> being used.

Reply via email to