Re: drmFromHDFS rowLabelBindings question

Dmitriy Lyubimov Sun, 14 Sep 2014 22:02:34 -0700

On Sat, Sep 13, 2014 at 12:18 PM, Andrew Palumbo <[email protected]> wrote:


>
>
>
>
>
>
>
>
>
> > All the discussion about engine neutral and specific bits is only going
> to come up more and more. Dmitriy speaks for the neutrality of “math” by
> which I take it to mean “math-scala” and stuff in the DSL. Maybe engine
> neutral bits that don’t fit in that can be put in another module to save
> fighting over it. I once proposed “core-scala”. For that matter
> cooccurrence isn’t really math or DSL (maybe that’s what D means by quasi)
> and so might be better put in core-scala too. Inclusion means the code uses
> but does not extend the DSL and the pom doesn’t include an engine
> >
>
> I think that this makes sense if we want to put the engine neutral
> sections of our underlying classifier/clustering/recommender algorithms and
> maybe some I/O traits into a separate module and keep the DSL and the more
> purely algebraic algos separate in mahout-math.  Then we can just mirror
> the packages and extend them (and their test suites as Dmitriy did with the
> math-scala tests) into h2o/spark/flink modules.  Then we can do the do the
> as needed engine specific optimization and engine specific I/O there.
>
+ 1. That's what i meant by separating dependent and indepent parts by
Strategies pattern.  Put independent strategies  into engine-agnostic
module(math-scala is fine, i suppose, in order not to multiply maven
artifacts).

>
> Same as what is being done now with mahout-math.
>
> I would think that it would be important to have complete algorithms (with
> empty I/O traits?) in the engine-neutral packages which is why i've been
> trying to implement the clunky extractLabelsAndAggregateObservations method
> in naive bayes for math-scala in an engine agnostic way.
>
>
>
>
>
> > On Sep 12, 2014, at 6:44 PM, ap.dev <[email protected]> wrote:
> >
> > Oh thx-  I thought indexedDatasets were spark specific.
> >
> >
> > Sent from my Verizon Wireless 4G LTE smartphone
> >
> > <div>-------- Original message --------</div><div>From: Pat Ferrel <
> [email protected]> </div><div>Date:09/12/2014  7:52 PM  (GMT-05:00)
> </div><div>To: [email protected] </div><div>Subject: Re: drmFromHDFS
> rowLabelBindings question </div><div>
> > </div>
> > The serialization can be in engine specific modules as with cooccurrence
> and ItemSimiarity. cooccurrence is in math-scala, ItemSmilarity is the
> engine specific driver. There is nothing engine specific about
> IndexedDatasets and an optimization that is not made yet is to allow one or
> no dictionaries where the keys suffice.
> >
> > Not sure what you want for initial input but you could start with a
> driver in the engine specific spark module, read in the IndexedDataset then
> pass it to your math code, work with the CheckpointedDrm using the DSL and
> dictionary then when done return an IndexedDataset to the driver for
> serialization.
> >
> > There’s also no reason that the serialization couldn’t also be
> implemented in H20, in fact I think it would be easier since they have
> richer text files types than Spark.
> >
> > Anand’s point about reducers is going to require either divergence or
> more engine neutral abstractions. I think serialization is in the same boat.
> >
> > On Sep 12, 2014, at 4:31 PM, Anand Avati <[email protected]> wrote:
> >
> > On Fri, Sep 12, 2014 at 4:12 PM, Andrew Palumbo <[email protected]>
> wrote:
> >
> > >
> > >
> > >
> > > Thanks- I've been looking at that a bit .. It probably would make
> things
> > > a whole lot easier but I'm working on Naive Bayes, and  trying to keep
> > > it in the math-scala package (I don't know how well this is going to
> > > work because I haven't made my way to model serialization yet).
> > >
> > > Thinking
> > > more of it though using an indexed dataset might make online
> > > training/updating the of the weights a whole lot easier if we end up
> > > implementing that.
> > >
> > > Also I think that an IndexedDataset will
> > > probably be useful for classifying new documents where we do need to
> > > keep the dictionary in memory.
> > >
> > > Right now, I just need  the
> > > labels up front in a vector so that i can extract the category and
> > > broadcast a categoryByRowindex Vector out to a combiner using something
> > > like:
> > >
> > >  IntKeyedTFIDFDrm.t.mapBlock(ncols=numcategories){
> > >        // aggregate cols by category}.t
> > >
> > > After
> > > that we only need a relatively small Vector or Map of rows(Categories)
> > > and don't need column labels as long as we're using seq2sparse.  It may
> > > make sense though to use something like an IndexedDataset here in the
> > > future if we want to move away from seq2sparse in its current
> > > implementation.
> > >
> > > I'm honestly not sure how well this label
> > > extraction and aggregation is going to turn out performance-wise..  But
> > > my thinking was that we can put an implementation in math-scala and
> then
> > > extend and optimize it in spark if we want ie. rather than writing a
> > > combiner using mapBlock- use spark's reduceByKey.
> > >
> >
> > Note that there is no way (yet) to perform aggregate or reduce like
> > operation through the DSL. Though the backends (both spark and h2o)
> support
> > reduce-like operations, there is no DSL operator for that yet. We could
> > either introduce a reduce/aggregate operator in as engine neutral/close
> to
> > algebraic way as possible, or keep any kind of reduction/aggregate phase
> of
> > operation backend specific (which kind of sucks)
> >
> > Thanks
> >
> >
> >
> > >> Subject: Re: drmFromHDFS rowLabelBindings question
> > >> From: [email protected]
> > >> Date: Fri, 12 Sep 2014 14:41:35 -0700
> > >> To: [email protected]
> > >>
> > >> Not sure if this helps but we (Sebastian and I) created an
> > > IndexedDataset which maintains row and column HashBiMaps that use the
> Int
> > > key to map to/from Strings. There are Reader and Writer traits for
> file IO
> > > (text files for now). The flow is to read an IndexedDataset using the
> > > Reader trait. Inside the IndexedDataset you have a CheckpointedDrm and
> two
> > > label BiMaps for rows and columns. This method is used in the row and
> item
> > > similarity jobs where you do math things like B.t %*% A After you do
> the
> > > math using the drm contained in the IndexedDataset you assign the
> correct
> > > dictionaries to the resulting IndexedDataset to maintain your labels
> for
> > > writing or further math. It might make sense to implement some of the
> math
> > > ops that would work with this simple approach but in any case you can
> do it
> > > explicitly as those jobs do. The idea was to support other file formats
> > > like sequence files as the need comes up.
> > >>
> > >> On Sep 12, 2014, at 1:14 PM, Andrew Palumbo <[email protected]>
> wrote:
> > >>
> > >> It doesn't look like it has anything to do with the conversion.
> > >>
> > >> after:
> > >>
> > >>  val rowBindings = d.map(t => (t._1._1.toString, t._2:
> > > java.lang.Integer)).toMap
> > >>
> > >> rowBindings.size  is one
> > >>
> > >> From: [email protected]
> > >> To: [email protected]
> > >> Subject: RE: drmFromHDFS rowLabelBindings question
> > >> Date: Fri, 12 Sep 2014 15:53:48 -0400
> > >>
> > >>
> > >>
> > >>
> > >> Thanks guys,  I was wondering about the java.util.Map conversion too.
> > > I'll try copying everything into a java.util.HashMap and passing that
> to
> > > setRowBindings.  I'll play around with it and if i cant get it to work,
> > > I'll file a jira.
> > >>
> > >> I'm just using it in the NB implementation so its not a pressing
> issue.
> > >>
> > >> Appreciate it.
> > >>
> > >>> Date: Fri, 12 Sep 2014 12:35:21 -0700
> > >>> Subject: Re: drmFromHDFS rowLabelBindings question
> > >>> From: [email protected]
> > >>> To: [email protected]
> > >>>
> > >>> On Fri, Sep 12, 2014 at 12:17 PM, Anand Avati <[email protected]>
> > > wrote:
> > >>>
> > >>>>
> > >>>>
> > >>>> On Fri, Sep 12, 2014 at 12:00 PM, Anand Avati <[email protected]>
> > > wrote:
> > >>>>
> > >>>>>
> > >>>>>
> > >>>>> On Fri, Sep 12, 2014 at 11:57 AM, Dmitriy Lyubimov <
> > > [email protected]>
> > >>>>> wrote:
> > >>>>>
> > >>>>>> bit i you are really compelled that it is something that might be
> > > needed,
> > >>>>>> the best way probably would be indeed create an optional parameter
> > > to
> > >>>>>> collect (something like
> > > drmLike.collect(extractLabels:Boolean=false))
> > >>>>>> which
> > >>>>>> you can flip to true if needed and the thing does toString on keys
> > > and
> > >>>>>> assinging them to in-core matrix' row labels. (requires a patch of
> > >>>>>> course)
> > >>>>>>
> > >>>>>>
> > >>>>> As I mentioned in the other mail, this is already the case. The
> code
> > >>>>> seems to assume .toMap internally does collect. My (somewhat wild)
> > >>>>> suspicion is that this line is somehow fooling the eye:
> > >>>>>
> > >>>>> val rowBindings = d.map(t => (t._1._1.toString, t._2:
> > > java.lang.Integer)).toMap
> > >>>>>
> > >>>>>
> > >>>>>
> > >>>> Argh, for a moment I was thinking `d` is still an rdd. It is
> actually
> > > all
> > >>>> in-core, as the entirety of the rdd is collected up front into
> > > `data`. In
> > >>>> any case I suspect the non-int key collecting code might be doing
> > > something
> > >>>> funny.
> > >>>>
> > >>>
> > >>> One problem I see is that toMap() returns scala.collections.Map,
> > > whereas
> > >>> the next line, m.setRowLabelBindings accepts a java.util.Map. Since
> the
> > >>> code compiles fine there is probably an implicit conversion happening
> > >>> somewhere, and I dont know if the conversion is doing the right
> thing.
> > >>> Other than this, rest of the code seems to look fine.
> > >>
> > >>
> > >
> > >
> > >
> >
> >
>
>
>
>

Re: drmFromHDFS rowLabelBindings question

Reply via email to