Let me be more concrete then, earlier there was mention of just working with the concept of a spark distributed data frame and bringing that into the mahout-dsl world rather than creating a mahout data frame, what I am proposing is to build a thin layer that abstracts away concepts in the mahout-dsl world (iterators/vector/matrix/slicing/retrieving individual or groups of data within a matrix) and the spark ddf world (which includes a dataframe and its underpinnings including a sql database, a set of data in one or more buckets in hdfs, groups of data inside a nosql database). I was hoping to do this in MAHOUT-1490 till I saw the input that we should just bring in and work with spark ddf. What I would like to understand is how mahout-dsl will work directly with spark ddf, is this realistic given the current data types within mahout-dsl world? Just wanting a deeper understanding because I still feel like an adaptation layer is needed.
> Date: Sat, 13 Sep 2014 10:44:05 -0700 > Subject: Re: drmFromHDFS rowLabelBindings question > From: [email protected] > To: [email protected] > > sorry. doesn't make sense to me. too abstract. > > On Sat, Sep 13, 2014 at 10:28 AM, Saikat Kanjilal <[email protected]> > wrote: > > > > Since there's (as it stands) no Mahout concept of df manipulation, > > there's > > > nothing to bridge to (in Bridge pattern sense) if that's what you mean. > > I don't think its just a bridge pattern as the design pattern but more of > > an adapter that contains mahout-dsl specific adapters that take a sparkddf > > and apply operations to make it more meaningful in the mahout-dsl world. > > So I would say that if mahout directly links and works with spark ddf that > > could be messy, I would think that linking in spark ddf would mean that we > > would need to bring in the sparkcontext, now I can imagine when building a > > particular algorithm that could leverage the concept of a dataframe (maybe > > what Andrew is doing with naive-bayes) wouldnt it be messy to have both > > SparkContext and MahoutContext in the same context of an algorithm. > > Another idea I was thinking about was embedding the concept of a dataframe > > directly into the engine specific code, in general I think there may be > > some complexity in directly incorporating with spark ddf. > > > > > > Thoughts? > > > Date: Sat, 13 Sep 2014 10:21:18 -0700 > > > Subject: Re: drmFromHDFS rowLabelBindings question > > > From: [email protected] > > > To: [email protected] > > > > > > On Sat, Sep 13, 2014 at 10:01 AM, Saikat Kanjilal <[email protected]> > > > wrote: > > > > > > > One question based on this discussion, is there anything we can > > provide on > > > > top of spark ddf that would be useful in working within mahout DSL, > > maybe > > > > what we really need to do is to build a thin layer with mahout > > nice-ties > > > > that links in spark ddf and nicely serves as a translation layer > > between > > > > the mahout data types and data types within a spark-ddf. Would love to > > > > hear if this has any merits. > > > > > > > > > > Since there's (as it stands) no Mahout concept of df manipulation, > > there's > > > nothing to bridge to (in Bridge pattern sense) if that's what you mean. > > > > > > However DF data types play important role in data cleansing and > > > featurization/vectorization, and featurization methods that Mahout > > > implements on Hadoop MR side, were not ported to Spark . We need to find > > a > > > way to re-implement those for spark -- and if we want to leave a way to > > do > > > an easy port to other engines, perhaps we should keep clean > > > engine-independent code form engine dependent using (as i mentioned > > before) > > > Strategies and Visitors. We also probably should move all the logic there > > > to Scala and scala collections (at least i'd try to do that). > > > > > > here i mean stuff similar to what currently seqDirectory, hasing trick > > and > > > feature normalization does. This requires a whole new architecture IMO in > > > light of engine portability requirements (if there are such > > requirements). > > > > > > > > > > > Date: Sat, 13 Sep 2014 09:57:17 -0700 > > > > > Subject: Re: drmFromHDFS rowLabelBindings question > > > > > From: [email protected] > > > > > To: [email protected] > > > > > > > > > > On the DF: > > > > > > > > > > I can see how data frames and aggregations in dplyr-like fashion > > could > > > > be a > > > > > good (and mappable to almost any of engines) abstraction. It could > > be a > > > > > strictly separate module, or even the same module, as long as it is > > > > simply > > > > > a different data type (trait) with its own set of operations. Note > > that > > > > > every popular computational platform i know (R, pandas, etc.) make > > sure > > > > to > > > > > make a very clear distinction between data frame operations and > > tensor > > > > > operations. And I believe there's a very good reason for that. > > > > > > > > > > While this engine-agnostic DF support whould be something extremely > > cool > > > > to > > > > > see in Mahout, in reality people don't really care so much about > > engine > > > > > independence per se -- they work with just one concrete back. So am > > I. > > > > And > > > > > as long as i work with Spark, there are numerous engine-speicific > > > > > implementations to do those transformations in a dplyr fashion -- > > MLI, > > > > > language-integrated Spark QL, and, to a lesser degree, DDF project. > > Since > > > > > these things can be easily run in context of Mahout (e.g. Spark QL is > > > > > already enabled in context of Mahout since it is a part of Spark > > release > > > > > now), then there's very little incentive to justify funding > > > > engine-indepent > > > > > distributed data frame support for somebody as pragmatical as myself. > > > > > > > > > > For folks that are looking for a nice thesis project this idea might > > be > > > > > indefinitely more attractive though. Even then though, question > > comes if > > > > > they'd be able to match the amount of effort poured into Spark QL, > > and > > > > > therefore, at least match its capabilities in the engine-independent > > way. > > > > > So Occam principle as a guiding light of pragmatism bodes that this > > > > effort > > > > > is therefore is quite unlikely to succeed. > > > > > > > > > > On "quasi-algebraic" term: > > > > > > > > > > What i mean here is that there's algebra, or associated set of > > > > conditional > > > > > forking, that does not necessarily can be implemented by existing > > R-like > > > > or > > > > > Matlab-like set of primitves acting on the tensor as a whole. Note > > that > > > > > even 5+3 is algebra (since those are tensors with single element). So > > > > > pretty much any numeric manipulation can be though of as algebra. > > Not any > > > > > numeric manipulation can be implemented on tensors with current set > > of > > > > > R-like operations though. > > > > > > > > > > Good example is ALS vs. implicit ALS. the ALS (even regularized one) > > is > > > > > easily expressed with operators acting on the entire tensor(s) which > > is > > > > > essentially just two lines in a loop : > > > > > > > > > > while (!stop && i < maxIterations) { > > > > > drmV = (drmAt %*% drmU %*% solve(drmU.t %*% drmU -: diag(lambda, > > > > > k))).checkpoint() > > > > > drmU = (drmA %*% drmV %*% solve(drmV.t %*% drmV -: diag(lambda, > > > > > k))).checkpoint() > > > > > ... > > > > > } > > > > > > > > > > This is obviously 100% engine independent. > > > > > > > > > > Whereas the implicit flavor unfortunately requires very speific way > > of > > > > > working on elements in both distributed and non-distributed > > > > > implementations. In distributed version it also implies using very > > > > > engine-speicifc way of shuffling and downsampling the data in order > > to > > > > stay > > > > > efficient. > > > > > > > > > > > > > > > > > > > > On Sat, Sep 13, 2014 at 8:52 AM, Pat Ferrel <[email protected]> > > > > wrote: > > > > > > > > > > > IndexedDatasets were a holding place for what was, at the time, > > going > > > > to > > > > > > be dataframes. Now no one seems interested in dataframes and I > > don’t > > > > mind > > > > > > since they solve the problems I had. > > > > > > > > > > > > All the discussion about engine neutral and specific bits is only > > > > going to > > > > > > come up more and more. Dmitriy speaks for the neutrality of “math” > > by > > > > which > > > > > > I take it to mean “math-scala” and stuff in the DSL. Maybe engine > > > > neutral > > > > > > bits that don’t fit in that can be put in another module to save > > > > fighting > > > > > > over it. I once proposed “core-scala”. For that matter cooccurrence > > > > isn’t > > > > > > really math or DSL (maybe that’s what D means by quasi) and so > > might be > > > > > > better put in core-scala too. Inclusion means the code uses but > > does > > > > not > > > > > > extend the DSL and the pom doesn’t include an engine > > > > > > > > > > > > On Sep 12, 2014, at 6:44 PM, ap.dev <[email protected]> wrote: > > > > > > > > > > > > Oh thx- I thought indexedDatasets were spark specific. > > > > > > > > > > > > > > > > > > Sent from my Verizon Wireless 4G LTE smartphone > > > > > > > > > > > > <div>-------- Original message --------</div><div>From: Pat Ferrel > > < > > > > > > [email protected]> </div><div>Date:09/12/2014 7:52 PM > > > > (GMT-05:00) > > > > > > </div><div>To: [email protected] </div><div>Subject: Re: > > > > drmFromHDFS > > > > > > rowLabelBindings question </div><div> > > > > > > </div> > > > > > > The serialization can be in engine specific modules as with > > > > cooccurrence > > > > > > and ItemSimiarity. cooccurrence is in math-scala, ItemSmilarity is > > the > > > > > > engine specific driver. There is nothing engine specific about > > > > > > IndexedDatasets and an optimization that is not made yet is to > > allow > > > > one or > > > > > > no dictionaries where the keys suffice. > > > > > > > > > > > > Not sure what you want for initial input but you could start with a > > > > driver > > > > > > in the engine specific spark module, read in the IndexedDataset > > then > > > > pass > > > > > > it to your math code, work with the CheckpointedDrm using the DSL > > and > > > > > > dictionary then when done return an IndexedDataset to the driver > > for > > > > > > serialization. > > > > > > > > > > > > There’s also no reason that the serialization couldn’t also be > > > > implemented > > > > > > in H20, in fact I think it would be easier since they have richer > > text > > > > > > files types than Spark. > > > > > > > > > > > > Anand’s point about reducers is going to require either divergence > > or > > > > more > > > > > > engine neutral abstractions. I think serialization is in the same > > boat. > > > > > > > > > > > > On Sep 12, 2014, at 4:31 PM, Anand Avati <[email protected]> > > wrote: > > > > > > > > > > > > On Fri, Sep 12, 2014 at 4:12 PM, Andrew Palumbo < > > [email protected]> > > > > > > wrote: > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Thanks- I've been looking at that a bit .. It probably would make > > > > things > > > > > > > a whole lot easier but I'm working on Naive Bayes, and trying to > > > > keep > > > > > > > it in the math-scala package (I don't know how well this is > > going to > > > > > > > work because I haven't made my way to model serialization yet). > > > > > > > > > > > > > > Thinking > > > > > > > more of it though using an indexed dataset might make online > > > > > > > training/updating the of the weights a whole lot easier if we > > end up > > > > > > > implementing that. > > > > > > > > > > > > > > Also I think that an IndexedDataset will > > > > > > > probably be useful for classifying new documents where we do > > need to > > > > > > > keep the dictionary in memory. > > > > > > > > > > > > > > Right now, I just need the > > > > > > > labels up front in a vector so that i can extract the category > > and > > > > > > > broadcast a categoryByRowindex Vector out to a combiner using > > > > something > > > > > > > like: > > > > > > > > > > > > > > IntKeyedTFIDFDrm.t.mapBlock(ncols=numcategories){ > > > > > > > // aggregate cols by category}.t > > > > > > > > > > > > > > After > > > > > > > that we only need a relatively small Vector or Map of > > > > rows(Categories) > > > > > > > and don't need column labels as long as we're using seq2sparse. > > It > > > > may > > > > > > > make sense though to use something like an IndexedDataset here > > in the > > > > > > > future if we want to move away from seq2sparse in its current > > > > > > > implementation. > > > > > > > > > > > > > > I'm honestly not sure how well this label > > > > > > > extraction and aggregation is going to turn out > > performance-wise.. > > > > But > > > > > > > my thinking was that we can put an implementation in math-scala > > and > > > > then > > > > > > > extend and optimize it in spark if we want ie. rather than > > writing a > > > > > > > combiner using mapBlock- use spark's reduceByKey. > > > > > > > > > > > > > > > > > > > Note that there is no way (yet) to perform aggregate or reduce like > > > > > > operation through the DSL. Though the backends (both spark and h2o) > > > > support > > > > > > reduce-like operations, there is no DSL operator for that yet. We > > could > > > > > > either introduce a reduce/aggregate operator in as engine > > > > neutral/close to > > > > > > algebraic way as possible, or keep any kind of reduction/aggregate > > > > phase of > > > > > > operation backend specific (which kind of sucks) > > > > > > > > > > > > Thanks > > > > > > > > > > > > > > > > > > > > > > > > >> Subject: Re: drmFromHDFS rowLabelBindings question > > > > > > >> From: [email protected] > > > > > > >> Date: Fri, 12 Sep 2014 14:41:35 -0700 > > > > > > >> To: [email protected] > > > > > > >> > > > > > > >> Not sure if this helps but we (Sebastian and I) created an > > > > > > > IndexedDataset which maintains row and column HashBiMaps that use > > > > the Int > > > > > > > key to map to/from Strings. There are Reader and Writer traits > > for > > > > file > > > > > > IO > > > > > > > (text files for now). The flow is to read an IndexedDataset > > using the > > > > > > > Reader trait. Inside the IndexedDataset you have a > > CheckpointedDrm > > > > and > > > > > > two > > > > > > > label BiMaps for rows and columns. This method is used in the > > row and > > > > > > item > > > > > > > similarity jobs where you do math things like B.t %*% A After > > you do > > > > the > > > > > > > math using the drm contained in the IndexedDataset you assign the > > > > correct > > > > > > > dictionaries to the resulting IndexedDataset to maintain your > > labels > > > > for > > > > > > > writing or further math. It might make sense to implement some > > of the > > > > > > math > > > > > > > ops that would work with this simple approach but in any case you > > > > can do > > > > > > it > > > > > > > explicitly as those jobs do. The idea was to support other file > > > > formats > > > > > > > like sequence files as the need comes up. > > > > > > >> > > > > > > >> On Sep 12, 2014, at 1:14 PM, Andrew Palumbo <[email protected] > > > > > > > wrote: > > > > > > >> > > > > > > >> It doesn't look like it has anything to do with the conversion. > > > > > > >> > > > > > > >> after: > > > > > > >> > > > > > > >> val rowBindings = d.map(t => (t._1._1.toString, t._2: > > > > > > > java.lang.Integer)).toMap > > > > > > >> > > > > > > >> rowBindings.size is one > > > > > > >> > > > > > > >> From: [email protected] > > > > > > >> To: [email protected] > > > > > > >> Subject: RE: drmFromHDFS rowLabelBindings question > > > > > > >> Date: Fri, 12 Sep 2014 15:53:48 -0400 > > > > > > >> > > > > > > >> > > > > > > >> > > > > > > >> > > > > > > >> Thanks guys, I was wondering about the java.util.Map conversion > > > > too. > > > > > > > I'll try copying everything into a java.util.HashMap and passing > > > > that to > > > > > > > setRowBindings. I'll play around with it and if i cant get it to > > > > work, > > > > > > > I'll file a jira. > > > > > > >> > > > > > > >> I'm just using it in the NB implementation so its not a pressing > > > > issue. > > > > > > >> > > > > > > >> Appreciate it. > > > > > > >> > > > > > > >>> Date: Fri, 12 Sep 2014 12:35:21 -0700 > > > > > > >>> Subject: Re: drmFromHDFS rowLabelBindings question > > > > > > >>> From: [email protected] > > > > > > >>> To: [email protected] > > > > > > >>> > > > > > > >>> On Fri, Sep 12, 2014 at 12:17 PM, Anand Avati < > > [email protected]> > > > > > > > wrote: > > > > > > >>> > > > > > > >>>> > > > > > > >>>> > > > > > > >>>> On Fri, Sep 12, 2014 at 12:00 PM, Anand Avati < > > [email protected]> > > > > > > > wrote: > > > > > > >>>> > > > > > > >>>>> > > > > > > >>>>> > > > > > > >>>>> On Fri, Sep 12, 2014 at 11:57 AM, Dmitriy Lyubimov < > > > > > > > [email protected]> > > > > > > >>>>> wrote: > > > > > > >>>>> > > > > > > >>>>>> bit i you are really compelled that it is something that > > might > > > > be > > > > > > > needed, > > > > > > >>>>>> the best way probably would be indeed create an optional > > > > parameter > > > > > > > to > > > > > > >>>>>> collect (something like > > > > > > > drmLike.collect(extractLabels:Boolean=false)) > > > > > > >>>>>> which > > > > > > >>>>>> you can flip to true if needed and the thing does toString > > on > > > > keys > > > > > > > and > > > > > > >>>>>> assinging them to in-core matrix' row labels. (requires a > > patch > > > > of > > > > > > >>>>>> course) > > > > > > >>>>>> > > > > > > >>>>>> > > > > > > >>>>> As I mentioned in the other mail, this is already the case. > > The > > > > code > > > > > > >>>>> seems to assume .toMap internally does collect. My (somewhat > > > > wild) > > > > > > >>>>> suspicion is that this line is somehow fooling the eye: > > > > > > >>>>> > > > > > > >>>>> val rowBindings = d.map(t => (t._1._1.toString, t._2: > > > > > > > java.lang.Integer)).toMap > > > > > > >>>>> > > > > > > >>>>> > > > > > > >>>>> > > > > > > >>>> Argh, for a moment I was thinking `d` is still an rdd. It is > > > > actually > > > > > > > all > > > > > > >>>> in-core, as the entirety of the rdd is collected up front into > > > > > > > `data`. In > > > > > > >>>> any case I suspect the non-int key collecting code might be > > doing > > > > > > > something > > > > > > >>>> funny. > > > > > > >>>> > > > > > > >>> > > > > > > >>> One problem I see is that toMap() returns > > scala.collections.Map, > > > > > > > whereas > > > > > > >>> the next line, m.setRowLabelBindings accepts a java.util.Map. > > > > Since the > > > > > > >>> code compiles fine there is probably an implicit conversion > > > > happening > > > > > > >>> somewhere, and I dont know if the conversion is doing the right > > > > thing. > > > > > > >>> Other than this, rest of the code seems to look fine. > > > > > > >> > > > > > > >> > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > >
