RE: drmFromHDFS rowLabelBindings question

Andrew Palumbo Sat, 13 Sep 2014 12:19:53 -0700







> All the discussion about engine neutral and specific bits is only going to 
> come up more and more. Dmitriy speaks for the neutrality of “math” by which I 
> take it to mean “math-scala” and stuff in the DSL. Maybe engine neutral bits 
> that don’t fit in that can be put in another module to save fighting over it. 
> I once proposed “core-scala”. For that matter cooccurrence isn’t really math 
> or DSL (maybe that’s what D means by quasi) and so might be better put in 
> core-scala too. Inclusion means the code uses but does not extend the DSL and 
> the pom doesn’t include an engine
> 

I think that this makes sense if we want to put the engine neutral sections of 
our underlying classifier/clustering/recommender algorithms and maybe some I/O 
traits into a separate module and keep the DSL and the more purely algebraic 
algos separate in mahout-math.  Then we can just mirror the packages and extend 
them (and their test suites as Dmitriy did with the math-scala tests) into 
h2o/spark/flink modules.  Then we can do the do the as needed engine specific 
optimization and engine specific I/O there. 

Same as what is being done now with mahout-math.

I would think that it would be important to have complete algorithms (with 
empty I/O traits?) in the engine-neutral packages which is why i've been trying 
to implement the clunky extractLabelsAndAggregateObservations method in naive 
bayes for math-scala in an engine agnostic way.

  



> On Sep 12, 2014, at 6:44 PM, ap.dev <[email protected]> wrote:
> 
> Oh thx-  I thought indexedDatasets were spark specific.
> 
> 
> Sent from my Verizon Wireless 4G LTE smartphone
> 
> <div>-------- Original message --------</div><div>From: Pat Ferrel 
> <[email protected]> </div><div>Date:09/12/2014  7:52 PM  (GMT-05:00) 
> </div><div>To: [email protected] </div><div>Subject: Re: drmFromHDFS 
> rowLabelBindings question </div><div>
> </div>
> The serialization can be in engine specific modules as with cooccurrence and 
> ItemSimiarity. cooccurrence is in math-scala, ItemSmilarity is the engine 
> specific driver. There is nothing engine specific about IndexedDatasets and 
> an optimization that is not made yet is to allow one or no dictionaries where 
> the keys suffice.
> 
> Not sure what you want for initial input but you could start with a driver in 
> the engine specific spark module, read in the IndexedDataset then pass it to 
> your math code, work with the CheckpointedDrm using the DSL and dictionary 
> then when done return an IndexedDataset to the driver for serialization.
> 
> There’s also no reason that the serialization couldn’t also be implemented in 
> H20, in fact I think it would be easier since they have richer text files 
> types than Spark.
> 
> Anand’s point about reducers is going to require either divergence or more 
> engine neutral abstractions. I think serialization is in the same boat.
> 
> On Sep 12, 2014, at 4:31 PM, Anand Avati <[email protected]> wrote:
> 
> On Fri, Sep 12, 2014 at 4:12 PM, Andrew Palumbo <[email protected]> wrote:
> 
> > 
> > 
> > 
> > Thanks- I've been looking at that a bit .. It probably would make things
> > a whole lot easier but I'm working on Naive Bayes, and  trying to keep
> > it in the math-scala package (I don't know how well this is going to
> > work because I haven't made my way to model serialization yet).
> > 
> > Thinking
> > more of it though using an indexed dataset might make online
> > training/updating the of the weights a whole lot easier if we end up
> > implementing that.
> > 
> > Also I think that an IndexedDataset will
> > probably be useful for classifying new documents where we do need to
> > keep the dictionary in memory.
> > 
> > Right now, I just need  the
> > labels up front in a vector so that i can extract the category and
> > broadcast a categoryByRowindex Vector out to a combiner using something
> > like:
> > 
> >  IntKeyedTFIDFDrm.t.mapBlock(ncols=numcategories){
> >        // aggregate cols by category}.t
> > 
> > After
> > that we only need a relatively small Vector or Map of rows(Categories)
> > and don't need column labels as long as we're using seq2sparse.  It may
> > make sense though to use something like an IndexedDataset here in the
> > future if we want to move away from seq2sparse in its current
> > implementation.
> > 
> > I'm honestly not sure how well this label
> > extraction and aggregation is going to turn out performance-wise..  But
> > my thinking was that we can put an implementation in math-scala and then
> > extend and optimize it in spark if we want ie. rather than writing a
> > combiner using mapBlock- use spark's reduceByKey.
> > 
> 
> Note that there is no way (yet) to perform aggregate or reduce like
> operation through the DSL. Though the backends (both spark and h2o) support
> reduce-like operations, there is no DSL operator for that yet. We could
> either introduce a reduce/aggregate operator in as engine neutral/close to
> algebraic way as possible, or keep any kind of reduction/aggregate phase of
> operation backend specific (which kind of sucks)
> 
> Thanks
> 
> 
> 
> >> Subject: Re: drmFromHDFS rowLabelBindings question
> >> From: [email protected]
> >> Date: Fri, 12 Sep 2014 14:41:35 -0700
> >> To: [email protected]
> >> 
> >> Not sure if this helps but we (Sebastian and I) created an
> > IndexedDataset which maintains row and column HashBiMaps that use the Int
> > key to map to/from Strings. There are Reader and Writer traits for file IO
> > (text files for now). The flow is to read an IndexedDataset using the
> > Reader trait. Inside the IndexedDataset you have a CheckpointedDrm and two
> > label BiMaps for rows and columns. This method is used in the row and item
> > similarity jobs where you do math things like B.t %*% A After you do the
> > math using the drm contained in the IndexedDataset you assign the correct
> > dictionaries to the resulting IndexedDataset to maintain your labels for
> > writing or further math. It might make sense to implement some of the math
> > ops that would work with this simple approach but in any case you can do it
> > explicitly as those jobs do. The idea was to support other file formats
> > like sequence files as the need comes up.
> >> 
> >> On Sep 12, 2014, at 1:14 PM, Andrew Palumbo <[email protected]> wrote:
> >> 
> >> It doesn't look like it has anything to do with the conversion.
> >> 
> >> after:
> >> 
> >>  val rowBindings = d.map(t => (t._1._1.toString, t._2:
> > java.lang.Integer)).toMap
> >> 
> >> rowBindings.size  is one
> >> 
> >> From: [email protected]
> >> To: [email protected]
> >> Subject: RE: drmFromHDFS rowLabelBindings question
> >> Date: Fri, 12 Sep 2014 15:53:48 -0400
> >> 
> >> 
> >> 
> >> 
> >> Thanks guys,  I was wondering about the java.util.Map conversion too.
> > I'll try copying everything into a java.util.HashMap and passing that to
> > setRowBindings.  I'll play around with it and if i cant get it to work,
> > I'll file a jira.
> >> 
> >> I'm just using it in the NB implementation so its not a pressing issue.
> >> 
> >> Appreciate it.
> >> 
> >>> Date: Fri, 12 Sep 2014 12:35:21 -0700
> >>> Subject: Re: drmFromHDFS rowLabelBindings question
> >>> From: [email protected]
> >>> To: [email protected]
> >>> 
> >>> On Fri, Sep 12, 2014 at 12:17 PM, Anand Avati <[email protected]>
> > wrote:
> >>> 
> >>>> 
> >>>> 
> >>>> On Fri, Sep 12, 2014 at 12:00 PM, Anand Avati <[email protected]>
> > wrote:
> >>>> 
> >>>>> 
> >>>>> 
> >>>>> On Fri, Sep 12, 2014 at 11:57 AM, Dmitriy Lyubimov <
> > [email protected]>
> >>>>> wrote:
> >>>>> 
> >>>>>> bit i you are really compelled that it is something that might be
> > needed,
> >>>>>> the best way probably would be indeed create an optional parameter
> > to
> >>>>>> collect (something like
> > drmLike.collect(extractLabels:Boolean=false))
> >>>>>> which
> >>>>>> you can flip to true if needed and the thing does toString on keys
> > and
> >>>>>> assinging them to in-core matrix' row labels. (requires a patch of
> >>>>>> course)
> >>>>>> 
> >>>>>> 
> >>>>> As I mentioned in the other mail, this is already the case. The code
> >>>>> seems to assume .toMap internally does collect. My (somewhat wild)
> >>>>> suspicion is that this line is somehow fooling the eye:
> >>>>> 
> >>>>> val rowBindings = d.map(t => (t._1._1.toString, t._2:
> > java.lang.Integer)).toMap
> >>>>> 
> >>>>> 
> >>>>> 
> >>>> Argh, for a moment I was thinking `d` is still an rdd. It is actually
> > all
> >>>> in-core, as the entirety of the rdd is collected up front into
> > `data`. In
> >>>> any case I suspect the non-int key collecting code might be doing
> > something
> >>>> funny.
> >>>> 
> >>> 
> >>> One problem I see is that toMap() returns scala.collections.Map,
> > whereas
> >>> the next line, m.setRowLabelBindings accepts a java.util.Map. Since the
> >>> code compiles fine there is probably an implicit conversion happening
> >>> somewhere, and I dont know if the conversion is doing the right thing.
> >>> Other than this, rest of the code seems to look fine.
> >> 
> >> 
> > 
> > 
> > 
> 
>
RE: drmFromHDFS rowLabelBindings question

Reply via email to