Perfect example about the common file formats problem: TopKStringPatterns.java. The FPGrowth jobs leave a SequenceFile of TopKStringPatterns, a multi-level data format. Nothing reads it.
On Fri, Sep 2, 2011 at 8:09 PM, Lance Norskog <[email protected]> wrote: > Spitting out an Hamake file or Oozie file should be straightforward. > > As a first step I would standardize all of the arguments. And, pick a list > of N Writables as "1st class" sequence files: if a job gets one of these, it > should know what to do. > > > On Thu, Sep 1, 2011 at 4:37 PM, Sebastian Schelter <[email protected]> wrote: > >> A first step into the right direction might be better tooling for creating >> the appropriate input data for our algorithms. >> >> We should have a job that creates the user-item-matrix for the >> recommendation stuff from CSV data with support for sampling, normalization, >> etc. I already wrote something like this for myself. I also started work on >> something like this for creating adjacency matrices in the graph package. >> >> Ideally most of our algorithms should be distributed linear algebra >> operations on distributed matrices (where possible). >> >> For example RowSimilarityJob is only a fancy way of computing A'A, >> ItemSimilarityJob is just a wrapper around that and RecommenderJob adds >> another multiplication with A' on the right. In the graph mining package >> PageRank and RandomWalkWithRestart are just eigenvector computations of the >> stochastified adjacency matrix. >> >> So I'd say we don't only need better job configuration but also a clearer >> separation between code that executes an algorithm and code that just >> converts data (where ever possible). >> >> --sebastian >> >> >> On 02.09.2011 00:34, Grant Ingersoll wrote: >> >>> On Sep 1, 2011, at 2:47 PM, Sean Owen wrote: >>> >>> That's completely right. The use case is more for restarting a failed >>>> job >>>> rather than configuring the pipeline. You "really" want to do something >>>> different like piece together your own job. >>>> >>> yeah, this is the downside to our big monolithic drivers. Oozie or >>> others might be useful here. >>> >>> This could be as complex as we want -- it could be its own project, >>>> defining >>>> a slightly-higher-level definition language for MR. In fact there are >>>> already one or two like that. >>>> >>> I was just thinking a registerJob to complement prepareJob might be >>> useful and simple and hook into the AbstractJob/ CLI params >>> >>> I like the idea... somehow I think you'll find it hard to implement >>>> across >>>> all the jobs since they're not even all in the same "format" at this >>>> point! >>>> >>> +1. Standardizing this stuff is important. >>> >>> >>>> On Thu, Sep 1, 2011 at 7:43 PM, Grant Ingersoll<[email protected]> >>>> wrote: >>>> >>>> Other than opening the code and looking, is there a way we register our >>>>> phases such that one could, via the command line, know what they are? >>>>> For >>>>> instance, I think, for now, I can skip, in my application, the first >>>>> two >>>>> phases of the RecommenderJob, but it seems a bit awkward to say >>>>> --startPhase >>>>> 2 given that at some point in a new release a new phase could be added >>>>> in >>>>> and I would then have to go check the code. Not the end of the world, >>>>> but >>>>> it seems error prone and not readily maintainable. I suppose as a >>>>> bonus, >>>>> it would be nice if one could also know where each phase expects things >>>>> to >>>>> be and in what format. Would it make sense to have the equivalent of >>>>> prepareJob that does registerJob up front and can then be dumped out so >>>>> that >>>>> one could see the phases and their inputs, etc? >>>>> >>>>> -Grant >>>>> >>>>> ------------------------------**-------------- >>>>> Grant Ingersoll >>>>> http://www.lucidimagination.**com <http://www.lucidimagination.com> >>>>> Lucene Eurocon 2011: http://www.lucene-eurocon.com >>>>> >>>>> >>>>> ------------------------------**-------------- >>> Grant Ingersoll >>> http://www.lucidimagination.**com <http://www.lucidimagination.com> >>> Lucene Eurocon 2011: http://www.lucene-eurocon.com >>> >>> >>> >> > > > -- > Lance Norskog > [email protected] > > -- Lance Norskog [email protected]
