Re: drmFromHDFS rowLabelBindings question

Dmitriy Lyubimov Sat, 13 Sep 2014 10:45:07 -0700

sorry. doesn't make sense to me. too abstract.

On Sat, Sep 13, 2014 at 10:28 AM, Saikat Kanjilal <[email protected]>
wrote:


> > Since there's (as it stands) no Mahout concept of df manipulation,
> there's
> > nothing to bridge to (in Bridge pattern sense) if that's what you mean.
> I don't think its just a bridge pattern as the design pattern but more of
> an adapter that contains mahout-dsl specific adapters that take a sparkddf
> and apply operations to make it more meaningful in the mahout-dsl world.
> So I would say that if mahout directly links and works with spark ddf that
> could be messy, I would think that linking in spark ddf would mean that we
> would need to bring in the sparkcontext, now I can imagine when building a
> particular algorithm that could leverage the concept of a dataframe (maybe
> what Andrew is doing with naive-bayes) wouldnt it be messy to have both
> SparkContext and MahoutContext in the same context of an algorithm.
> Another idea I was thinking about was embedding the concept of a dataframe
> directly into the engine specific code, in general I think there may be
> some complexity in directly incorporating with spark ddf.
>
>
> Thoughts?
> > Date: Sat, 13 Sep 2014 10:21:18 -0700
> > Subject: Re: drmFromHDFS rowLabelBindings question
> > From: [email protected]
> > To: [email protected]
> >
> > On Sat, Sep 13, 2014 at 10:01 AM, Saikat Kanjilal <[email protected]>
> > wrote:
> >
> > > One question based on this discussion, is there anything we can
> provide on
> > > top of spark ddf that would be useful in working within mahout DSL,
> maybe
> > > what we really need to do is to build a thin layer with mahout
> nice-ties
> > > that links in spark ddf and nicely serves as a translation layer
> between
> > > the mahout data types and data types within a spark-ddf.  Would love to
> > > hear if this has any merits.
> > >
> >
> > Since there's (as it stands) no Mahout concept of df manipulation,
> there's
> > nothing to bridge to (in Bridge pattern sense) if that's what you mean.
> >
> > However DF data types play important role in data cleansing and
> > featurization/vectorization, and featurization methods that Mahout
> > implements on Hadoop MR side, were not ported to Spark . We need to find
> a
> > way to re-implement those for spark -- and if we want to leave a way to
> do
> > an easy port to  other engines, perhaps we should keep clean
> > engine-independent code form engine dependent using (as i mentioned
> before)
> > Strategies and Visitors. We also probably should move all the logic there
> > to Scala and scala collections (at least i'd try to do that).
> >
> > here i mean stuff similar to what  currently seqDirectory, hasing trick
> and
> > feature normalization does. This requires a whole new architecture IMO in
> > light of engine portability requirements (if there are such
> requirements).
> >
> >
> > > > Date: Sat, 13 Sep 2014 09:57:17 -0700
> > > > Subject: Re: drmFromHDFS rowLabelBindings question
> > > > From: [email protected]
> > > > To: [email protected]
> > > >
> > > > On the DF:
> > > >
> > > > I can see how data frames and aggregations in dplyr-like fashion
> could
> > > be a
> > > > good (and mappable to almost any of engines) abstraction. It could
> be a
> > > > strictly separate module, or even the same module, as long as it is
> > > simply
> > > > a different data type (trait) with its own set of operations. Note
> that
> > > > every popular computational platform i know (R, pandas, etc.) make
> sure
> > > to
> > > > make a very clear distinction between data frame operations and
> tensor
> > > > operations. And I believe there's a very good reason for that.
> > > >
> > > > While this engine-agnostic DF support whould be something extremely
> cool
> > > to
> > > > see in Mahout, in reality people don't really care so much about
> engine
> > > > independence per se -- they work with just one concrete back. So am
> I.
> > > And
> > > > as long as i work with Spark, there are numerous engine-speicific
> > > > implementations to do those transformations in a dplyr fashion --
> MLI,
> > > > language-integrated Spark QL, and, to a lesser degree, DDF project.
> Since
> > > > these things can be easily run in context of Mahout (e.g. Spark QL is
> > > > already enabled in context of Mahout since it is a part of Spark
> release
> > > > now), then there's very little incentive to justify funding
> > > engine-indepent
> > > > distributed data frame support for somebody as pragmatical as myself.
> > > >
> > > > For folks that are looking for a nice thesis project this idea might
> be
> > > > indefinitely more attractive though. Even then though, question
> comes if
> > > > they'd be able to match the amount of effort poured into Spark QL,
> and
> > > > therefore, at least match its capabilities in the engine-independent
> way.
> > > > So Occam principle as a guiding light of pragmatism bodes that this
> > > effort
> > > > is therefore is quite unlikely to succeed.
> > > >
> > > > On "quasi-algebraic" term:
> > > >
> > > > What i mean here is that there's algebra, or associated set of
> > > conditional
> > > > forking, that does not necessarily can be implemented by existing
> R-like
> > > or
> > > > Matlab-like set of primitves acting on the tensor as a whole. Note
> that
> > > > even 5+3 is algebra (since those are tensors with single element). So
> > > > pretty much any numeric manipulation can be though of as algebra.
> Not any
> > > > numeric manipulation can be implemented on tensors with current set
> of
> > > > R-like operations though.
> > > >
> > > > Good example is ALS vs. implicit ALS. the ALS (even regularized one)
> is
> > > > easily expressed with operators acting on the entire tensor(s) which
> is
> > > > essentially just two lines in a loop :
> > > >
> > > > while (!stop && i < maxIterations) {
> > > >      drmV = (drmAt %*% drmU %*% solve(drmU.t %*% drmU -: diag(lambda,
> > > > k))).checkpoint()
> > > >      drmU = (drmA %*% drmV %*% solve(drmV.t %*% drmV -: diag(lambda,
> > > > k))).checkpoint()
> > > >     ...
> > > > }
> > > >
> > > > This is obviously 100% engine independent.
> > > >
> > > > Whereas the implicit flavor unfortunately requires very speific way
> of
> > > > working on elements in both distributed and non-distributed
> > > > implementations. In distributed version it also implies using very
> > > > engine-speicifc way of shuffling and downsampling the data in order
> to
> > > stay
> > > > efficient.
> > > >
> > > >
> > > >
> > > > On Sat, Sep 13, 2014 at 8:52 AM, Pat Ferrel <[email protected]>
> > > wrote:
> > > >
> > > > > IndexedDatasets were a holding place for what was, at the time,
> going
> > > to
> > > > > be dataframes. Now no one seems interested in dataframes and I
> don’t
> > > mind
> > > > > since they solve the problems I had.
> > > > >
> > > > > All the discussion about engine neutral and specific bits is only
> > > going to
> > > > > come up more and more. Dmitriy speaks for the neutrality of “math”
> by
> > > which
> > > > > I take it to mean “math-scala” and stuff in the DSL. Maybe engine
> > > neutral
> > > > > bits that don’t fit in that can be put in another module to save
> > > fighting
> > > > > over it. I once proposed “core-scala”. For that matter cooccurrence
> > > isn’t
> > > > > really math or DSL (maybe that’s what D means by quasi) and so
> might be
> > > > > better put in core-scala too. Inclusion means the code uses but
> does
> > > not
> > > > > extend the DSL and the pom doesn’t include an engine
> > > > >
> > > > > On Sep 12, 2014, at 6:44 PM, ap.dev <[email protected]> wrote:
> > > > >
> > > > > Oh thx-  I thought indexedDatasets were spark specific.
> > > > >
> > > > >
> > > > > Sent from my Verizon Wireless 4G LTE smartphone
> > > > >
> > > > > <div>-------- Original message --------</div><div>From: Pat Ferrel
> <
> > > > > [email protected]> </div><div>Date:09/12/2014  7:52 PM
> > > (GMT-05:00)
> > > > > </div><div>To: [email protected] </div><div>Subject: Re:
> > > drmFromHDFS
> > > > > rowLabelBindings question </div><div>
> > > > > </div>
> > > > > The serialization can be in engine specific modules as with
> > > cooccurrence
> > > > > and ItemSimiarity. cooccurrence is in math-scala, ItemSmilarity is
> the
> > > > > engine specific driver. There is nothing engine specific about
> > > > > IndexedDatasets and an optimization that is not made yet is to
> allow
> > > one or
> > > > > no dictionaries where the keys suffice.
> > > > >
> > > > > Not sure what you want for initial input but you could start with a
> > > driver
> > > > > in the engine specific spark module, read in the IndexedDataset
> then
> > > pass
> > > > > it to your math code, work with the CheckpointedDrm using the DSL
> and
> > > > > dictionary then when done return an IndexedDataset to the driver
> for
> > > > > serialization.
> > > > >
> > > > > There’s also no reason that the serialization couldn’t also be
> > > implemented
> > > > > in H20, in fact I think it would be easier since they have richer
> text
> > > > > files types than Spark.
> > > > >
> > > > > Anand’s point about reducers is going to require either divergence
> or
> > > more
> > > > > engine neutral abstractions. I think serialization is in the same
> boat.
> > > > >
> > > > > On Sep 12, 2014, at 4:31 PM, Anand Avati <[email protected]>
> wrote:
> > > > >
> > > > > On Fri, Sep 12, 2014 at 4:12 PM, Andrew Palumbo <
> [email protected]>
> > > > > wrote:
> > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > > > Thanks- I've been looking at that a bit .. It probably would make
> > > things
> > > > > > a whole lot easier but I'm working on Naive Bayes, and  trying to
> > > keep
> > > > > > it in the math-scala package (I don't know how well this is
> going to
> > > > > > work because I haven't made my way to model serialization yet).
> > > > > >
> > > > > > Thinking
> > > > > > more of it though using an indexed dataset might make online
> > > > > > training/updating the of the weights a whole lot easier if we
> end up
> > > > > > implementing that.
> > > > > >
> > > > > > Also I think that an IndexedDataset will
> > > > > > probably be useful for classifying new documents where we do
> need to
> > > > > > keep the dictionary in memory.
> > > > > >
> > > > > > Right now, I just need  the
> > > > > > labels up front in a vector so that i can extract the category
> and
> > > > > > broadcast a categoryByRowindex Vector out to a combiner using
> > > something
> > > > > > like:
> > > > > >
> > > > > >  IntKeyedTFIDFDrm.t.mapBlock(ncols=numcategories){
> > > > > >        // aggregate cols by category}.t
> > > > > >
> > > > > > After
> > > > > > that we only need a relatively small Vector or Map of
> > > rows(Categories)
> > > > > > and don't need column labels as long as we're using seq2sparse.
> It
> > > may
> > > > > > make sense though to use something like an IndexedDataset here
> in the
> > > > > > future if we want to move away from seq2sparse in its current
> > > > > > implementation.
> > > > > >
> > > > > > I'm honestly not sure how well this label
> > > > > > extraction and aggregation is going to turn out
> performance-wise..
> > > But
> > > > > > my thinking was that we can put an implementation in math-scala
> and
> > > then
> > > > > > extend and optimize it in spark if we want ie. rather than
> writing a
> > > > > > combiner using mapBlock- use spark's reduceByKey.
> > > > > >
> > > > >
> > > > > Note that there is no way (yet) to perform aggregate or reduce like
> > > > > operation through the DSL. Though the backends (both spark and h2o)
> > > support
> > > > > reduce-like operations, there is no DSL operator for that yet. We
> could
> > > > > either introduce a reduce/aggregate operator in as engine
> > > neutral/close to
> > > > > algebraic way as possible, or keep any kind of reduction/aggregate
> > > phase of
> > > > > operation backend specific (which kind of sucks)
> > > > >
> > > > > Thanks
> > > > >
> > > > >
> > > > >
> > > > > >> Subject: Re: drmFromHDFS rowLabelBindings question
> > > > > >> From: [email protected]
> > > > > >> Date: Fri, 12 Sep 2014 14:41:35 -0700
> > > > > >> To: [email protected]
> > > > > >>
> > > > > >> Not sure if this helps but we (Sebastian and I) created an
> > > > > > IndexedDataset which maintains row and column HashBiMaps that use
> > > the Int
> > > > > > key to map to/from Strings. There are Reader and Writer traits
> for
> > > file
> > > > > IO
> > > > > > (text files for now). The flow is to read an IndexedDataset
> using the
> > > > > > Reader trait. Inside the IndexedDataset you have a
> CheckpointedDrm
> > > and
> > > > > two
> > > > > > label BiMaps for rows and columns. This method is used in the
> row and
> > > > > item
> > > > > > similarity jobs where you do math things like B.t %*% A After
> you do
> > > the
> > > > > > math using the drm contained in the IndexedDataset you assign the
> > > correct
> > > > > > dictionaries to the resulting IndexedDataset to maintain your
> labels
> > > for
> > > > > > writing or further math. It might make sense to implement some
> of the
> > > > > math
> > > > > > ops that would work with this simple approach but in any case you
> > > can do
> > > > > it
> > > > > > explicitly as those jobs do. The idea was to support other file
> > > formats
> > > > > > like sequence files as the need comes up.
> > > > > >>
> > > > > >> On Sep 12, 2014, at 1:14 PM, Andrew Palumbo <[email protected]
> >
> > > wrote:
> > > > > >>
> > > > > >> It doesn't look like it has anything to do with the conversion.
> > > > > >>
> > > > > >> after:
> > > > > >>
> > > > > >>  val rowBindings = d.map(t => (t._1._1.toString, t._2:
> > > > > > java.lang.Integer)).toMap
> > > > > >>
> > > > > >> rowBindings.size  is one
> > > > > >>
> > > > > >> From: [email protected]
> > > > > >> To: [email protected]
> > > > > >> Subject: RE: drmFromHDFS rowLabelBindings question
> > > > > >> Date: Fri, 12 Sep 2014 15:53:48 -0400
> > > > > >>
> > > > > >>
> > > > > >>
> > > > > >>
> > > > > >> Thanks guys,  I was wondering about the java.util.Map conversion
> > > too.
> > > > > > I'll try copying everything into a java.util.HashMap and passing
> > > that to
> > > > > > setRowBindings.  I'll play around with it and if i cant get it to
> > > work,
> > > > > > I'll file a jira.
> > > > > >>
> > > > > >> I'm just using it in the NB implementation so its not a pressing
> > > issue.
> > > > > >>
> > > > > >> Appreciate it.
> > > > > >>
> > > > > >>> Date: Fri, 12 Sep 2014 12:35:21 -0700
> > > > > >>> Subject: Re: drmFromHDFS rowLabelBindings question
> > > > > >>> From: [email protected]
> > > > > >>> To: [email protected]
> > > > > >>>
> > > > > >>> On Fri, Sep 12, 2014 at 12:17 PM, Anand Avati <
> [email protected]>
> > > > > > wrote:
> > > > > >>>
> > > > > >>>>
> > > > > >>>>
> > > > > >>>> On Fri, Sep 12, 2014 at 12:00 PM, Anand Avati <
> [email protected]>
> > > > > > wrote:
> > > > > >>>>
> > > > > >>>>>
> > > > > >>>>>
> > > > > >>>>> On Fri, Sep 12, 2014 at 11:57 AM, Dmitriy Lyubimov <
> > > > > > [email protected]>
> > > > > >>>>> wrote:
> > > > > >>>>>
> > > > > >>>>>> bit i you are really compelled that it is something that
> might
> > > be
> > > > > > needed,
> > > > > >>>>>> the best way probably would be indeed create an optional
> > > parameter
> > > > > > to
> > > > > >>>>>> collect (something like
> > > > > > drmLike.collect(extractLabels:Boolean=false))
> > > > > >>>>>> which
> > > > > >>>>>> you can flip to true if needed and the thing does toString
> on
> > > keys
> > > > > > and
> > > > > >>>>>> assinging them to in-core matrix' row labels. (requires a
> patch
> > > of
> > > > > >>>>>> course)
> > > > > >>>>>>
> > > > > >>>>>>
> > > > > >>>>> As I mentioned in the other mail, this is already the case.
> The
> > > code
> > > > > >>>>> seems to assume .toMap internally does collect. My (somewhat
> > > wild)
> > > > > >>>>> suspicion is that this line is somehow fooling the eye:
> > > > > >>>>>
> > > > > >>>>> val rowBindings = d.map(t => (t._1._1.toString, t._2:
> > > > > > java.lang.Integer)).toMap
> > > > > >>>>>
> > > > > >>>>>
> > > > > >>>>>
> > > > > >>>> Argh, for a moment I was thinking `d` is still an rdd. It is
> > > actually
> > > > > > all
> > > > > >>>> in-core, as the entirety of the rdd is collected up front into
> > > > > > `data`. In
> > > > > >>>> any case I suspect the non-int key collecting code might be
> doing
> > > > > > something
> > > > > >>>> funny.
> > > > > >>>>
> > > > > >>>
> > > > > >>> One problem I see is that toMap() returns
> scala.collections.Map,
> > > > > > whereas
> > > > > >>> the next line, m.setRowLabelBindings accepts a java.util.Map.
> > > Since the
> > > > > >>> code compiles fine there is probably an implicit conversion
> > > happening
> > > > > >>> somewhere, and I dont know if the conversion is doing the right
> > > thing.
> > > > > >>> Other than this, rest of the code seems to look fine.
> > > > > >>
> > > > > >>
> > > > > >
> > > > > >
> > > > > >
> > > > >
> > > > >
> > > > >
> > >
> > >
>
>

Re: drmFromHDFS rowLabelBindings question

Reply via email to