On Sat, Sep 13, 2014 at 12:18 PM, Andrew Palumbo <[email protected]> wrote:
> > > > > > > > > > > All the discussion about engine neutral and specific bits is only going > to come up more and more. Dmitriy speaks for the neutrality of “math” by > which I take it to mean “math-scala” and stuff in the DSL. Maybe engine > neutral bits that don’t fit in that can be put in another module to save > fighting over it. I once proposed “core-scala”. For that matter > cooccurrence isn’t really math or DSL (maybe that’s what D means by quasi) > and so might be better put in core-scala too. Inclusion means the code uses > but does not extend the DSL and the pom doesn’t include an engine > > > > I think that this makes sense if we want to put the engine neutral > sections of our underlying classifier/clustering/recommender algorithms and > maybe some I/O traits into a separate module and keep the DSL and the more > purely algebraic algos separate in mahout-math. Then we can just mirror > the packages and extend them (and their test suites as Dmitriy did with the > math-scala tests) into h2o/spark/flink modules. Then we can do the do the > as needed engine specific optimization and engine specific I/O there. > + 1. That's what i meant by separating dependent and indepent parts by Strategies pattern. Put independent strategies into engine-agnostic module(math-scala is fine, i suppose, in order not to multiply maven artifacts). > > Same as what is being done now with mahout-math. > > I would think that it would be important to have complete algorithms (with > empty I/O traits?) in the engine-neutral packages which is why i've been > trying to implement the clunky extractLabelsAndAggregateObservations method > in naive bayes for math-scala in an engine agnostic way. > > > > > > > On Sep 12, 2014, at 6:44 PM, ap.dev <[email protected]> wrote: > > > > Oh thx- I thought indexedDatasets were spark specific. > > > > > > Sent from my Verizon Wireless 4G LTE smartphone > > > > <div>-------- Original message --------</div><div>From: Pat Ferrel < > [email protected]> </div><div>Date:09/12/2014 7:52 PM (GMT-05:00) > </div><div>To: [email protected] </div><div>Subject: Re: drmFromHDFS > rowLabelBindings question </div><div> > > </div> > > The serialization can be in engine specific modules as with cooccurrence > and ItemSimiarity. cooccurrence is in math-scala, ItemSmilarity is the > engine specific driver. There is nothing engine specific about > IndexedDatasets and an optimization that is not made yet is to allow one or > no dictionaries where the keys suffice. > > > > Not sure what you want for initial input but you could start with a > driver in the engine specific spark module, read in the IndexedDataset then > pass it to your math code, work with the CheckpointedDrm using the DSL and > dictionary then when done return an IndexedDataset to the driver for > serialization. > > > > There’s also no reason that the serialization couldn’t also be > implemented in H20, in fact I think it would be easier since they have > richer text files types than Spark. > > > > Anand’s point about reducers is going to require either divergence or > more engine neutral abstractions. I think serialization is in the same boat. > > > > On Sep 12, 2014, at 4:31 PM, Anand Avati <[email protected]> wrote: > > > > On Fri, Sep 12, 2014 at 4:12 PM, Andrew Palumbo <[email protected]> > wrote: > > > > > > > > > > > > > > Thanks- I've been looking at that a bit .. It probably would make > things > > > a whole lot easier but I'm working on Naive Bayes, and trying to keep > > > it in the math-scala package (I don't know how well this is going to > > > work because I haven't made my way to model serialization yet). > > > > > > Thinking > > > more of it though using an indexed dataset might make online > > > training/updating the of the weights a whole lot easier if we end up > > > implementing that. > > > > > > Also I think that an IndexedDataset will > > > probably be useful for classifying new documents where we do need to > > > keep the dictionary in memory. > > > > > > Right now, I just need the > > > labels up front in a vector so that i can extract the category and > > > broadcast a categoryByRowindex Vector out to a combiner using something > > > like: > > > > > > IntKeyedTFIDFDrm.t.mapBlock(ncols=numcategories){ > > > // aggregate cols by category}.t > > > > > > After > > > that we only need a relatively small Vector or Map of rows(Categories) > > > and don't need column labels as long as we're using seq2sparse. It may > > > make sense though to use something like an IndexedDataset here in the > > > future if we want to move away from seq2sparse in its current > > > implementation. > > > > > > I'm honestly not sure how well this label > > > extraction and aggregation is going to turn out performance-wise.. But > > > my thinking was that we can put an implementation in math-scala and > then > > > extend and optimize it in spark if we want ie. rather than writing a > > > combiner using mapBlock- use spark's reduceByKey. > > > > > > > Note that there is no way (yet) to perform aggregate or reduce like > > operation through the DSL. Though the backends (both spark and h2o) > support > > reduce-like operations, there is no DSL operator for that yet. We could > > either introduce a reduce/aggregate operator in as engine neutral/close > to > > algebraic way as possible, or keep any kind of reduction/aggregate phase > of > > operation backend specific (which kind of sucks) > > > > Thanks > > > > > > > > >> Subject: Re: drmFromHDFS rowLabelBindings question > > >> From: [email protected] > > >> Date: Fri, 12 Sep 2014 14:41:35 -0700 > > >> To: [email protected] > > >> > > >> Not sure if this helps but we (Sebastian and I) created an > > > IndexedDataset which maintains row and column HashBiMaps that use the > Int > > > key to map to/from Strings. There are Reader and Writer traits for > file IO > > > (text files for now). The flow is to read an IndexedDataset using the > > > Reader trait. Inside the IndexedDataset you have a CheckpointedDrm and > two > > > label BiMaps for rows and columns. This method is used in the row and > item > > > similarity jobs where you do math things like B.t %*% A After you do > the > > > math using the drm contained in the IndexedDataset you assign the > correct > > > dictionaries to the resulting IndexedDataset to maintain your labels > for > > > writing or further math. It might make sense to implement some of the > math > > > ops that would work with this simple approach but in any case you can > do it > > > explicitly as those jobs do. The idea was to support other file formats > > > like sequence files as the need comes up. > > >> > > >> On Sep 12, 2014, at 1:14 PM, Andrew Palumbo <[email protected]> > wrote: > > >> > > >> It doesn't look like it has anything to do with the conversion. > > >> > > >> after: > > >> > > >> val rowBindings = d.map(t => (t._1._1.toString, t._2: > > > java.lang.Integer)).toMap > > >> > > >> rowBindings.size is one > > >> > > >> From: [email protected] > > >> To: [email protected] > > >> Subject: RE: drmFromHDFS rowLabelBindings question > > >> Date: Fri, 12 Sep 2014 15:53:48 -0400 > > >> > > >> > > >> > > >> > > >> Thanks guys, I was wondering about the java.util.Map conversion too. > > > I'll try copying everything into a java.util.HashMap and passing that > to > > > setRowBindings. I'll play around with it and if i cant get it to work, > > > I'll file a jira. > > >> > > >> I'm just using it in the NB implementation so its not a pressing > issue. > > >> > > >> Appreciate it. > > >> > > >>> Date: Fri, 12 Sep 2014 12:35:21 -0700 > > >>> Subject: Re: drmFromHDFS rowLabelBindings question > > >>> From: [email protected] > > >>> To: [email protected] > > >>> > > >>> On Fri, Sep 12, 2014 at 12:17 PM, Anand Avati <[email protected]> > > > wrote: > > >>> > > >>>> > > >>>> > > >>>> On Fri, Sep 12, 2014 at 12:00 PM, Anand Avati <[email protected]> > > > wrote: > > >>>> > > >>>>> > > >>>>> > > >>>>> On Fri, Sep 12, 2014 at 11:57 AM, Dmitriy Lyubimov < > > > [email protected]> > > >>>>> wrote: > > >>>>> > > >>>>>> bit i you are really compelled that it is something that might be > > > needed, > > >>>>>> the best way probably would be indeed create an optional parameter > > > to > > >>>>>> collect (something like > > > drmLike.collect(extractLabels:Boolean=false)) > > >>>>>> which > > >>>>>> you can flip to true if needed and the thing does toString on keys > > > and > > >>>>>> assinging them to in-core matrix' row labels. (requires a patch of > > >>>>>> course) > > >>>>>> > > >>>>>> > > >>>>> As I mentioned in the other mail, this is already the case. The > code > > >>>>> seems to assume .toMap internally does collect. My (somewhat wild) > > >>>>> suspicion is that this line is somehow fooling the eye: > > >>>>> > > >>>>> val rowBindings = d.map(t => (t._1._1.toString, t._2: > > > java.lang.Integer)).toMap > > >>>>> > > >>>>> > > >>>>> > > >>>> Argh, for a moment I was thinking `d` is still an rdd. It is > actually > > > all > > >>>> in-core, as the entirety of the rdd is collected up front into > > > `data`. In > > >>>> any case I suspect the non-int key collecting code might be doing > > > something > > >>>> funny. > > >>>> > > >>> > > >>> One problem I see is that toMap() returns scala.collections.Map, > > > whereas > > >>> the next line, m.setRowLabelBindings accepts a java.util.Map. Since > the > > >>> code compiles fine there is probably an implicit conversion happening > > >>> somewhere, and I dont know if the conversion is doing the right > thing. > > >>> Other than this, rest of the code seems to look fine. > > >> > > >> > > > > > > > > > > > > > > > > >
