Spitting out an Hamake file or Oozie file should be straightforward. As a first step I would standardize all of the arguments. And, pick a list of N Writables as "1st class" sequence files: if a job gets one of these, it should know what to do.
On Thu, Sep 1, 2011 at 4:37 PM, Sebastian Schelter <[email protected]> wrote: > A first step into the right direction might be better tooling for creating > the appropriate input data for our algorithms. > > We should have a job that creates the user-item-matrix for the > recommendation stuff from CSV data with support for sampling, normalization, > etc. I already wrote something like this for myself. I also started work on > something like this for creating adjacency matrices in the graph package. > > Ideally most of our algorithms should be distributed linear algebra > operations on distributed matrices (where possible). > > For example RowSimilarityJob is only a fancy way of computing A'A, > ItemSimilarityJob is just a wrapper around that and RecommenderJob adds > another multiplication with A' on the right. In the graph mining package > PageRank and RandomWalkWithRestart are just eigenvector computations of the > stochastified adjacency matrix. > > So I'd say we don't only need better job configuration but also a clearer > separation between code that executes an algorithm and code that just > converts data (where ever possible). > > --sebastian > > > On 02.09.2011 00:34, Grant Ingersoll wrote: > >> On Sep 1, 2011, at 2:47 PM, Sean Owen wrote: >> >> That's completely right. The use case is more for restarting a failed job >>> rather than configuring the pipeline. You "really" want to do something >>> different like piece together your own job. >>> >> yeah, this is the downside to our big monolithic drivers. Oozie or others >> might be useful here. >> >> This could be as complex as we want -- it could be its own project, >>> defining >>> a slightly-higher-level definition language for MR. In fact there are >>> already one or two like that. >>> >> I was just thinking a registerJob to complement prepareJob might be useful >> and simple and hook into the AbstractJob/ CLI params >> >> I like the idea... somehow I think you'll find it hard to implement >>> across >>> all the jobs since they're not even all in the same "format" at this >>> point! >>> >> +1. Standardizing this stuff is important. >> >> >>> On Thu, Sep 1, 2011 at 7:43 PM, Grant Ingersoll<[email protected]> >>> wrote: >>> >>> Other than opening the code and looking, is there a way we register our >>>> phases such that one could, via the command line, know what they are? >>>> For >>>> instance, I think, for now, I can skip, in my application, the first two >>>> phases of the RecommenderJob, but it seems a bit awkward to say >>>> --startPhase >>>> 2 given that at some point in a new release a new phase could be added >>>> in >>>> and I would then have to go check the code. Not the end of the world, >>>> but >>>> it seems error prone and not readily maintainable. I suppose as a >>>> bonus, >>>> it would be nice if one could also know where each phase expects things >>>> to >>>> be and in what format. Would it make sense to have the equivalent of >>>> prepareJob that does registerJob up front and can then be dumped out so >>>> that >>>> one could see the phases and their inputs, etc? >>>> >>>> -Grant >>>> >>>> ------------------------------**-------------- >>>> Grant Ingersoll >>>> http://www.lucidimagination.**com <http://www.lucidimagination.com> >>>> Lucene Eurocon 2011: http://www.lucene-eurocon.com >>>> >>>> >>>> ------------------------------**-------------- >> Grant Ingersoll >> http://www.lucidimagination.**com <http://www.lucidimagination.com> >> Lucene Eurocon 2011: http://www.lucene-eurocon.com >> >> >> > -- Lance Norskog [email protected]
