Perfect example about the common file formats problem:
TopKStringPatterns.java. The FPGrowth jobs leave a SequenceFile of
TopKStringPatterns, a multi-level data format. Nothing reads it.

On Fri, Sep 2, 2011 at 8:09 PM, Lance Norskog <[email protected]> wrote:

> Spitting out an Hamake file or Oozie file should be straightforward.
>
> As a first step I would standardize all of the  arguments. And, pick a list
> of N Writables as "1st class" sequence files: if a job gets one of these, it
> should know what to do.
>
>
> On Thu, Sep 1, 2011 at 4:37 PM, Sebastian Schelter <[email protected]> wrote:
>
>> A first step into the right direction might be better tooling for creating
>> the appropriate input data for our algorithms.
>>
>> We should have a job that creates the user-item-matrix for the
>> recommendation stuff from CSV data with support for sampling, normalization,
>> etc. I already wrote something like this for myself. I also started work on
>> something like this for creating adjacency matrices in the graph package.
>>
>> Ideally most of our algorithms should be distributed linear algebra
>> operations on distributed matrices (where possible).
>>
>> For example RowSimilarityJob is only a fancy way of computing A'A,
>> ItemSimilarityJob is just a wrapper around that and RecommenderJob adds
>> another multiplication with A' on the right. In the graph mining package
>> PageRank and RandomWalkWithRestart are just eigenvector computations of the
>> stochastified adjacency matrix.
>>
>> So I'd say we don't only need better job configuration but also a clearer
>> separation between code that executes an algorithm and code that just
>> converts data (where ever possible).
>>
>> --sebastian
>>
>>
>> On 02.09.2011 00:34, Grant Ingersoll wrote:
>>
>>> On Sep 1, 2011, at 2:47 PM, Sean Owen wrote:
>>>
>>>  That's completely right. The use case is more for restarting a failed
>>>> job
>>>> rather than configuring the pipeline. You "really" want to do something
>>>> different like piece together your own job.
>>>>
>>> yeah, this is the downside to our big monolithic drivers.  Oozie or
>>> others might be useful here.
>>>
>>>  This could be as complex as we want -- it could be its own project,
>>>> defining
>>>> a slightly-higher-level definition language for MR. In fact there are
>>>> already one or two like that.
>>>>
>>> I was just thinking a registerJob to complement prepareJob might be
>>> useful and simple and hook into the AbstractJob/ CLI params
>>>
>>>  I like the idea... somehow I think you'll find it hard to implement
>>>> across
>>>> all the jobs since they're not even all in the same "format" at this
>>>> point!
>>>>
>>> +1.  Standardizing this stuff is important.
>>>
>>>
>>>> On Thu, Sep 1, 2011 at 7:43 PM, Grant Ingersoll<[email protected]>
>>>>  wrote:
>>>>
>>>>  Other than opening the code and looking, is there a way we register our
>>>>> phases such that one could, via the command line, know what they are?
>>>>>  For
>>>>> instance, I think, for now, I can skip, in my application, the first
>>>>> two
>>>>> phases of the RecommenderJob, but it seems a bit awkward to say
>>>>> --startPhase
>>>>> 2 given that at some point in a new release a new phase could be added
>>>>> in
>>>>> and I would then have to go check the code.  Not the end of the world,
>>>>> but
>>>>> it seems error prone and not readily maintainable.    I suppose as a
>>>>> bonus,
>>>>> it would be nice if one could also know where each phase expects things
>>>>> to
>>>>> be and in what format.  Would it make sense to have the equivalent of
>>>>> prepareJob that does registerJob up front and can then be dumped out so
>>>>> that
>>>>> one could see the phases and their inputs, etc?
>>>>>
>>>>> -Grant
>>>>>
>>>>> ------------------------------**--------------
>>>>> Grant Ingersoll
>>>>> http://www.lucidimagination.**com <http://www.lucidimagination.com>
>>>>> Lucene Eurocon 2011: http://www.lucene-eurocon.com
>>>>>
>>>>>
>>>>>  ------------------------------**--------------
>>> Grant Ingersoll
>>> http://www.lucidimagination.**com <http://www.lucidimagination.com>
>>> Lucene Eurocon 2011: http://www.lucene-eurocon.com
>>>
>>>
>>>
>>
>
>
> --
> Lance Norskog
> [email protected]
>
>


-- 
Lance Norskog
[email protected]

Reply via email to