One question based on this discussion, is there anything we can provide on top of spark ddf that would be useful in working within mahout DSL, maybe what we really need to do is to build a thin layer with mahout nice-ties that links in spark ddf and nicely serves as a translation layer between the mahout data types and data types within a spark-ddf. Would love to hear if this has any merits.
> Date: Sat, 13 Sep 2014 09:57:17 -0700 > Subject: Re: drmFromHDFS rowLabelBindings question > From: [email protected] > To: [email protected] > > On the DF: > > I can see how data frames and aggregations in dplyr-like fashion could be a > good (and mappable to almost any of engines) abstraction. It could be a > strictly separate module, or even the same module, as long as it is simply > a different data type (trait) with its own set of operations. Note that > every popular computational platform i know (R, pandas, etc.) make sure to > make a very clear distinction between data frame operations and tensor > operations. And I believe there's a very good reason for that. > > While this engine-agnostic DF support whould be something extremely cool to > see in Mahout, in reality people don't really care so much about engine > independence per se -- they work with just one concrete back. So am I. And > as long as i work with Spark, there are numerous engine-speicific > implementations to do those transformations in a dplyr fashion -- MLI, > language-integrated Spark QL, and, to a lesser degree, DDF project. Since > these things can be easily run in context of Mahout (e.g. Spark QL is > already enabled in context of Mahout since it is a part of Spark release > now), then there's very little incentive to justify funding engine-indepent > distributed data frame support for somebody as pragmatical as myself. > > For folks that are looking for a nice thesis project this idea might be > indefinitely more attractive though. Even then though, question comes if > they'd be able to match the amount of effort poured into Spark QL, and > therefore, at least match its capabilities in the engine-independent way. > So Occam principle as a guiding light of pragmatism bodes that this effort > is therefore is quite unlikely to succeed. > > On "quasi-algebraic" term: > > What i mean here is that there's algebra, or associated set of conditional > forking, that does not necessarily can be implemented by existing R-like or > Matlab-like set of primitves acting on the tensor as a whole. Note that > even 5+3 is algebra (since those are tensors with single element). So > pretty much any numeric manipulation can be though of as algebra. Not any > numeric manipulation can be implemented on tensors with current set of > R-like operations though. > > Good example is ALS vs. implicit ALS. the ALS (even regularized one) is > easily expressed with operators acting on the entire tensor(s) which is > essentially just two lines in a loop : > > while (!stop && i < maxIterations) { > drmV = (drmAt %*% drmU %*% solve(drmU.t %*% drmU -: diag(lambda, > k))).checkpoint() > drmU = (drmA %*% drmV %*% solve(drmV.t %*% drmV -: diag(lambda, > k))).checkpoint() > ... > } > > This is obviously 100% engine independent. > > Whereas the implicit flavor unfortunately requires very speific way of > working on elements in both distributed and non-distributed > implementations. In distributed version it also implies using very > engine-speicifc way of shuffling and downsampling the data in order to stay > efficient. > > > > On Sat, Sep 13, 2014 at 8:52 AM, Pat Ferrel <[email protected]> wrote: > > > IndexedDatasets were a holding place for what was, at the time, going to > > be dataframes. Now no one seems interested in dataframes and I don’t mind > > since they solve the problems I had. > > > > All the discussion about engine neutral and specific bits is only going to > > come up more and more. Dmitriy speaks for the neutrality of “math” by which > > I take it to mean “math-scala” and stuff in the DSL. Maybe engine neutral > > bits that don’t fit in that can be put in another module to save fighting > > over it. I once proposed “core-scala”. For that matter cooccurrence isn’t > > really math or DSL (maybe that’s what D means by quasi) and so might be > > better put in core-scala too. Inclusion means the code uses but does not > > extend the DSL and the pom doesn’t include an engine > > > > On Sep 12, 2014, at 6:44 PM, ap.dev <[email protected]> wrote: > > > > Oh thx- I thought indexedDatasets were spark specific. > > > > > > Sent from my Verizon Wireless 4G LTE smartphone > > > > <div>-------- Original message --------</div><div>From: Pat Ferrel < > > [email protected]> </div><div>Date:09/12/2014 7:52 PM (GMT-05:00) > > </div><div>To: [email protected] </div><div>Subject: Re: drmFromHDFS > > rowLabelBindings question </div><div> > > </div> > > The serialization can be in engine specific modules as with cooccurrence > > and ItemSimiarity. cooccurrence is in math-scala, ItemSmilarity is the > > engine specific driver. There is nothing engine specific about > > IndexedDatasets and an optimization that is not made yet is to allow one or > > no dictionaries where the keys suffice. > > > > Not sure what you want for initial input but you could start with a driver > > in the engine specific spark module, read in the IndexedDataset then pass > > it to your math code, work with the CheckpointedDrm using the DSL and > > dictionary then when done return an IndexedDataset to the driver for > > serialization. > > > > There’s also no reason that the serialization couldn’t also be implemented > > in H20, in fact I think it would be easier since they have richer text > > files types than Spark. > > > > Anand’s point about reducers is going to require either divergence or more > > engine neutral abstractions. I think serialization is in the same boat. > > > > On Sep 12, 2014, at 4:31 PM, Anand Avati <[email protected]> wrote: > > > > On Fri, Sep 12, 2014 at 4:12 PM, Andrew Palumbo <[email protected]> > > wrote: > > > > > > > > > > > > > > Thanks- I've been looking at that a bit .. It probably would make things > > > a whole lot easier but I'm working on Naive Bayes, and trying to keep > > > it in the math-scala package (I don't know how well this is going to > > > work because I haven't made my way to model serialization yet). > > > > > > Thinking > > > more of it though using an indexed dataset might make online > > > training/updating the of the weights a whole lot easier if we end up > > > implementing that. > > > > > > Also I think that an IndexedDataset will > > > probably be useful for classifying new documents where we do need to > > > keep the dictionary in memory. > > > > > > Right now, I just need the > > > labels up front in a vector so that i can extract the category and > > > broadcast a categoryByRowindex Vector out to a combiner using something > > > like: > > > > > > IntKeyedTFIDFDrm.t.mapBlock(ncols=numcategories){ > > > // aggregate cols by category}.t > > > > > > After > > > that we only need a relatively small Vector or Map of rows(Categories) > > > and don't need column labels as long as we're using seq2sparse. It may > > > make sense though to use something like an IndexedDataset here in the > > > future if we want to move away from seq2sparse in its current > > > implementation. > > > > > > I'm honestly not sure how well this label > > > extraction and aggregation is going to turn out performance-wise.. But > > > my thinking was that we can put an implementation in math-scala and then > > > extend and optimize it in spark if we want ie. rather than writing a > > > combiner using mapBlock- use spark's reduceByKey. > > > > > > > Note that there is no way (yet) to perform aggregate or reduce like > > operation through the DSL. Though the backends (both spark and h2o) support > > reduce-like operations, there is no DSL operator for that yet. We could > > either introduce a reduce/aggregate operator in as engine neutral/close to > > algebraic way as possible, or keep any kind of reduction/aggregate phase of > > operation backend specific (which kind of sucks) > > > > Thanks > > > > > > > > >> Subject: Re: drmFromHDFS rowLabelBindings question > > >> From: [email protected] > > >> Date: Fri, 12 Sep 2014 14:41:35 -0700 > > >> To: [email protected] > > >> > > >> Not sure if this helps but we (Sebastian and I) created an > > > IndexedDataset which maintains row and column HashBiMaps that use the Int > > > key to map to/from Strings. There are Reader and Writer traits for file > > IO > > > (text files for now). The flow is to read an IndexedDataset using the > > > Reader trait. Inside the IndexedDataset you have a CheckpointedDrm and > > two > > > label BiMaps for rows and columns. This method is used in the row and > > item > > > similarity jobs where you do math things like B.t %*% A After you do the > > > math using the drm contained in the IndexedDataset you assign the correct > > > dictionaries to the resulting IndexedDataset to maintain your labels for > > > writing or further math. It might make sense to implement some of the > > math > > > ops that would work with this simple approach but in any case you can do > > it > > > explicitly as those jobs do. The idea was to support other file formats > > > like sequence files as the need comes up. > > >> > > >> On Sep 12, 2014, at 1:14 PM, Andrew Palumbo <[email protected]> wrote: > > >> > > >> It doesn't look like it has anything to do with the conversion. > > >> > > >> after: > > >> > > >> val rowBindings = d.map(t => (t._1._1.toString, t._2: > > > java.lang.Integer)).toMap > > >> > > >> rowBindings.size is one > > >> > > >> From: [email protected] > > >> To: [email protected] > > >> Subject: RE: drmFromHDFS rowLabelBindings question > > >> Date: Fri, 12 Sep 2014 15:53:48 -0400 > > >> > > >> > > >> > > >> > > >> Thanks guys, I was wondering about the java.util.Map conversion too. > > > I'll try copying everything into a java.util.HashMap and passing that to > > > setRowBindings. I'll play around with it and if i cant get it to work, > > > I'll file a jira. > > >> > > >> I'm just using it in the NB implementation so its not a pressing issue. > > >> > > >> Appreciate it. > > >> > > >>> Date: Fri, 12 Sep 2014 12:35:21 -0700 > > >>> Subject: Re: drmFromHDFS rowLabelBindings question > > >>> From: [email protected] > > >>> To: [email protected] > > >>> > > >>> On Fri, Sep 12, 2014 at 12:17 PM, Anand Avati <[email protected]> > > > wrote: > > >>> > > >>>> > > >>>> > > >>>> On Fri, Sep 12, 2014 at 12:00 PM, Anand Avati <[email protected]> > > > wrote: > > >>>> > > >>>>> > > >>>>> > > >>>>> On Fri, Sep 12, 2014 at 11:57 AM, Dmitriy Lyubimov < > > > [email protected]> > > >>>>> wrote: > > >>>>> > > >>>>>> bit i you are really compelled that it is something that might be > > > needed, > > >>>>>> the best way probably would be indeed create an optional parameter > > > to > > >>>>>> collect (something like > > > drmLike.collect(extractLabels:Boolean=false)) > > >>>>>> which > > >>>>>> you can flip to true if needed and the thing does toString on keys > > > and > > >>>>>> assinging them to in-core matrix' row labels. (requires a patch of > > >>>>>> course) > > >>>>>> > > >>>>>> > > >>>>> As I mentioned in the other mail, this is already the case. The code > > >>>>> seems to assume .toMap internally does collect. My (somewhat wild) > > >>>>> suspicion is that this line is somehow fooling the eye: > > >>>>> > > >>>>> val rowBindings = d.map(t => (t._1._1.toString, t._2: > > > java.lang.Integer)).toMap > > >>>>> > > >>>>> > > >>>>> > > >>>> Argh, for a moment I was thinking `d` is still an rdd. It is actually > > > all > > >>>> in-core, as the entirety of the rdd is collected up front into > > > `data`. In > > >>>> any case I suspect the non-int key collecting code might be doing > > > something > > >>>> funny. > > >>>> > > >>> > > >>> One problem I see is that toMap() returns scala.collections.Map, > > > whereas > > >>> the next line, m.setRowLabelBindings accepts a java.util.Map. Since the > > >>> code compiles fine there is probably an implicit conversion happening > > >>> somewhere, and I dont know if the conversion is doing the right thing. > > >>> Other than this, rest of the code seems to look fine. > > >> > > >> > > > > > > > > > > > > > > >
