Re: Codebase refactoring proposal

Pat Ferrel Sun, 08 Feb 2015 09:06:06 -0800

Why aren’t we using linalg.Vector and its siblings? The same could be asked for 
linalg.Matrix. If we want to prune dependencies this would help and would also 
significantly increase interoperability.

Case-now: I have a real need to cluster items in a CF type input matrix. The 
input matrix A’ has row of items. I need to drop this into a sequence file and 
use Mahout’s hadoop KMeans. Ugh. Or I need to convert A’ into an RDD of 
linalg.Vectors and use MLlib Kmeans. The conversion is not too bad and maybe 
could be helped with some implicit conversions mahout.Vector <-> linalg.Vector 
(maybe mahout.DRM <-> linalg.Matrix, though not needed for Kmeans).

Case-possible: If we adopted linalg.Vector as the native format and perhaps 
even linalg.Matrix this would give immediate interoperability in some areas 
including my specific need. It would significantly pare down dependencies not 
provided by the environment (Mahout-math). It would also support creating 
distributed computation methods that would work on MLlib and Mahout datasets 
addressing Gokhan’s question.

I looked at another “Case-now” possibility, which was to go all MLlib with item 
similarity. I found that MLlib doesn’t have a transpose—“transpose, why would 
you want to do that?” Not even in the multiply form A’A, A’B, AA’, all used in 
item and row similarity. That stopped me from looking deeper.

The strength and unique value of Mahout is the completeness of its generalized 
linear algebra DSL. But insistence on using Mahout specific data types is also 
a barrier for Spark people adopting the DSL. Not having lower level 
interoperability is a barrier both ways to mixing Mahout and MLlib—creating 
unnecessary either/or choices for devs.

On Feb 5, 2015, at 1:32 PM, Dmitriy Lyubimov <dlie...@gmail.com> wrote:

On Thu, Feb 5, 2015 at 1:14 AM, Gokhan Capan <gkhn...@gmail.com> wrote:

> What I am saying is that for certain algorithms including both
> engine-specific (such as aggregation) and DSL stuff, what is the best way
> of handling them?
> 
> i) should we add the distributed operations to Mahout codebase as it is
> proposed in #62?
> 

Imo this can't go very well and very far (because of the engine specifics)
but i'd be willing to see an experiment with simple things like map and
reduce.

Bigger quesitons are, where exactly we'll have to stop (we can't abstract
all capabilities out there becuase of "common denominator" issues), and
what percentage of methods will it truly allow to migrate to full backend
portability.

And if after doing all this, we will still find ourselves writing engine
specific mixes, why bother. Wouldn't it be better to find a good,
easy-to-replicate, incrementally-developed pattern to register and apply
engine-specific strategies for every method?

> 
> ii) should we have [engine]-ml modules (like spark-bindings and
> h2o-bindings) where we can mix the DSL and engine-specific stuff?
> 

This is not quite what i am proposing. Rather, engine-ml modules holding
engine-specific _parts_ of algorithm.

However, this really needs a POC over a guniea pig (similarly to how we
POC'd algebra in the first place with ssvd and spca).

> 
>

Re: Codebase refactoring proposal

Reply via email to