A first step into the right direction might be better tooling for creating the appropriate input data for our algorithms.

We should have a job that creates the user-item-matrix for the recommendation stuff from CSV data with support for sampling, normalization, etc. I already wrote something like this for myself. I also started work on something like this for creating adjacency matrices in the graph package.

Ideally most of our algorithms should be distributed linear algebra operations on distributed matrices (where possible).

For example RowSimilarityJob is only a fancy way of computing A'A, ItemSimilarityJob is just a wrapper around that and RecommenderJob adds another multiplication with A' on the right. In the graph mining package PageRank and RandomWalkWithRestart are just eigenvector computations of the stochastified adjacency matrix.

So I'd say we don't only need better job configuration but also a clearer separation between code that executes an algorithm and code that just converts data (where ever possible).

--sebastian

On 02.09.2011 00:34, Grant Ingersoll wrote:
On Sep 1, 2011, at 2:47 PM, Sean Owen wrote:

That's completely right. The use case is more for restarting a failed job
rather than configuring the pipeline. You "really" want to do something
different like piece together your own job.
yeah, this is the downside to our big monolithic drivers.  Oozie or others 
might be useful here.

This could be as complex as we want -- it could be its own project, defining
a slightly-higher-level definition language for MR. In fact there are
already one or two like that.
I was just thinking a registerJob to complement prepareJob might be useful and 
simple and hook into the AbstractJob/ CLI params

I like the idea... somehow I think you'll find it hard to implement across
all the jobs since they're not even all in the same "format" at this point!
+1.  Standardizing this stuff is important.


On Thu, Sep 1, 2011 at 7:43 PM, Grant Ingersoll<[email protected]>  wrote:

Other than opening the code and looking, is there a way we register our
phases such that one could, via the command line, know what they are?  For
instance, I think, for now, I can skip, in my application, the first two
phases of the RecommenderJob, but it seems a bit awkward to say --startPhase
2 given that at some point in a new release a new phase could be added in
and I would then have to go check the code.  Not the end of the world, but
it seems error prone and not readily maintainable.    I suppose as a bonus,
it would be nice if one could also know where each phase expects things to
be and in what format.  Would it make sense to have the equivalent of
prepareJob that does registerJob up front and can then be dumped out so that
one could see the phases and their inputs, etc?

-Grant

--------------------------------------------
Grant Ingersoll
http://www.lucidimagination.com
Lucene Eurocon 2011: http://www.lucene-eurocon.com


--------------------------------------------
Grant Ingersoll
http://www.lucidimagination.com
Lucene Eurocon 2011: http://www.lucene-eurocon.com



Reply via email to