A bit more detail on what needs to happen here IMO: Likely, hadoop-releated things we still need for spark etc. like VectorWritable need to be factored out into a (new) module mahout-hadoop or something. Important notion here is that we only want to depend on hadoop-commons, which in theory should be common for both new and old hadoop MR apis. We may face the fact that we need hdfs as well there; e.g. perhaps for reading sequence file headers, not sure; but we definitely do not need anything mapreduce.
Math still cannot depend on that mahout-hadoop since math must not depend on anything hadoop, that was the premise since like the beginning. Mahout-math is in-core ops only, lightweight, self-contained thing. more likely, spark module (and maybe some others if they use that) will have to depend on hadoop serialization for vectors and matrices directly, i.e. on mahout-hadoop. mrlegacy stuff of course needs to be completely isolated (nobody else depends on it) and made dependent on mahout-hadoop as well. On Fri, Dec 12, 2014 at 9:38 AM, Pat Ferrel <[email protected]> wrote: > > The next time someone wants to get into contributing to Mahout, wouldn’t > it be nice to prune dependencies? > > For instance Spark depends on math-scala, which depends on math—at least > ideally but in reality dependencies include mr-legacy. If some things were > refactored into math we might have a much streamlined dependency tree. Some > things in Math also can be replaced with newer Scala libs and so could be > moved out to a java-common or something that would not be required by the > Scala code. > > If people are going to use the V1 version of Mahout it would be nice if > the choice didn’t force them to drag along all the legacy code if it isn’t > being used.
