Hadoop dependencies are a quagmire. It would be far preferable to rewrite the necessary serialization to avoid Hadoop dependencies entirely.
If we dropping the MR code, why do we need to reference the VectorWritable class at all? Even in the worse case, we could simply recode the binary layer from scratch without the heinous dependencies. On Fri, Dec 12, 2014 at 10:06 AM, Dmitriy Lyubimov <[email protected]> wrote: > > A bit more detail on what needs to happen here IMO: > > Likely, hadoop-releated things we still need for spark etc. like > VectorWritable need to be factored out into a (new) module mahout-hadoop or > something. Important notion here is that we only want to depend on > hadoop-commons, which in theory should be common for both new and old > hadoop MR apis. We may face the fact that we need hdfs as well there; e.g. > perhaps for reading sequence file headers, not sure; but we definitely do > not need anything mapreduce. > > Math still cannot depend on that mahout-hadoop since math must not depend > on anything hadoop, that was the premise since like the beginning. > Mahout-math is in-core ops only, lightweight, self-contained thing. > > more likely, spark module (and maybe some others if they use that) will > have to depend on hadoop serialization for vectors and matrices directly, > i.e. on mahout-hadoop. > > mrlegacy stuff of course needs to be completely isolated (nobody else > depends on it) and made dependent on mahout-hadoop as well. > > On Fri, Dec 12, 2014 at 9:38 AM, Pat Ferrel <[email protected]> wrote: > > > > The next time someone wants to get into contributing to Mahout, wouldn’t > > it be nice to prune dependencies? > > > > For instance Spark depends on math-scala, which depends on math—at least > > ideally but in reality dependencies include mr-legacy. If some things > were > > refactored into math we might have a much streamlined dependency tree. > Some > > things in Math also can be replaced with newer Scala libs and so could be > > moved out to a java-common or something that would not be required by the > > Scala code. > > > > If people are going to use the V1 version of Mahout it would be nice if > > the choice didn’t force them to drag along all the legacy code if it > isn’t > > being used. >
