A first step into the right direction might be better tooling for
creating the appropriate input data for our algorithms.
We should have a job that creates the user-item-matrix for the
recommendation stuff from CSV data with support for sampling,
normalization, etc. I already wrote something like this for myself. I
also started work on something like this for creating adjacency matrices
in the graph package.
Ideally most of our algorithms should be distributed linear algebra
operations on distributed matrices (where possible).
For example RowSimilarityJob is only a fancy way of computing A'A,
ItemSimilarityJob is just a wrapper around that and RecommenderJob adds
another multiplication with A' on the right. In the graph mining package
PageRank and RandomWalkWithRestart are just eigenvector computations of
the stochastified adjacency matrix.
So I'd say we don't only need better job configuration but also a
clearer separation between code that executes an algorithm and code that
just converts data (where ever possible).
--sebastian
On 02.09.2011 00:34, Grant Ingersoll wrote:
On Sep 1, 2011, at 2:47 PM, Sean Owen wrote:
That's completely right. The use case is more for restarting a failed job
rather than configuring the pipeline. You "really" want to do something
different like piece together your own job.
yeah, this is the downside to our big monolithic drivers. Oozie or others
might be useful here.
This could be as complex as we want -- it could be its own project, defining
a slightly-higher-level definition language for MR. In fact there are
already one or two like that.
I was just thinking a registerJob to complement prepareJob might be useful and
simple and hook into the AbstractJob/ CLI params
I like the idea... somehow I think you'll find it hard to implement across
all the jobs since they're not even all in the same "format" at this point!
+1. Standardizing this stuff is important.
On Thu, Sep 1, 2011 at 7:43 PM, Grant Ingersoll<[email protected]> wrote:
Other than opening the code and looking, is there a way we register our
phases such that one could, via the command line, know what they are? For
instance, I think, for now, I can skip, in my application, the first two
phases of the RecommenderJob, but it seems a bit awkward to say --startPhase
2 given that at some point in a new release a new phase could be added in
and I would then have to go check the code. Not the end of the world, but
it seems error prone and not readily maintainable. I suppose as a bonus,
it would be nice if one could also know where each phase expects things to
be and in what format. Would it make sense to have the equivalent of
prepareJob that does registerJob up front and can then be dumped out so that
one could see the phases and their inputs, etc?
-Grant
--------------------------------------------
Grant Ingersoll
http://www.lucidimagination.com
Lucene Eurocon 2011: http://www.lucene-eurocon.com
--------------------------------------------
Grant Ingersoll
http://www.lucidimagination.com
Lucene Eurocon 2011: http://www.lucene-eurocon.com