On Fri, Sep 12, 2014 at 4:12 PM, Andrew Palumbo <[email protected]> wrote:
>
>
>
> Thanks- I've been looking at that a bit .. It probably would make things
> a whole lot easier but I'm working on Naive Bayes, and trying to keep
> it in the math-scala package (I don't know how well this is going to
> work because I haven't made my way to model serialization yet).
>
> Thinking
> more of it though using an indexed dataset might make online
> training/updating the of the weights a whole lot easier if we end up
> implementing that.
>
> Also I think that an IndexedDataset will
> probably be useful for classifying new documents where we do need to
> keep the dictionary in memory.
>
> Right now, I just need the
> labels up front in a vector so that i can extract the category and
> broadcast a categoryByRowindex Vector out to a combiner using something
> like:
>
> IntKeyedTFIDFDrm.t.mapBlock(ncols=numcategories){
> // aggregate cols by category}.t
>
> After
> that we only need a relatively small Vector or Map of rows(Categories)
> and don't need column labels as long as we're using seq2sparse. It may
> make sense though to use something like an IndexedDataset here in the
> future if we want to move away from seq2sparse in its current
> implementation.
>
> I'm honestly not sure how well this label
> extraction and aggregation is going to turn out performance-wise.. But
> my thinking was that we can put an implementation in math-scala and then
> extend and optimize it in spark if we want ie. rather than writing a
> combiner using mapBlock- use spark's reduceByKey.
>
Note that there is no way (yet) to perform aggregate or reduce like
operation through the DSL. Though the backends (both spark and h2o) support
reduce-like operations, there is no DSL operator for that yet. We could
either introduce a reduce/aggregate operator in as engine neutral/close to
algebraic way as possible, or keep any kind of reduction/aggregate phase of
operation backend specific (which kind of sucks)
Thanks
> > Subject: Re: drmFromHDFS rowLabelBindings question
> > From: [email protected]
> > Date: Fri, 12 Sep 2014 14:41:35 -0700
> > To: [email protected]
> >
> > Not sure if this helps but we (Sebastian and I) created an
> IndexedDataset which maintains row and column HashBiMaps that use the Int
> key to map to/from Strings. There are Reader and Writer traits for file IO
> (text files for now). The flow is to read an IndexedDataset using the
> Reader trait. Inside the IndexedDataset you have a CheckpointedDrm and two
> label BiMaps for rows and columns. This method is used in the row and item
> similarity jobs where you do math things like B.t %*% A After you do the
> math using the drm contained in the IndexedDataset you assign the correct
> dictionaries to the resulting IndexedDataset to maintain your labels for
> writing or further math. It might make sense to implement some of the math
> ops that would work with this simple approach but in any case you can do it
> explicitly as those jobs do. The idea was to support other file formats
> like sequence files as the need comes up.
> >
> > On Sep 12, 2014, at 1:14 PM, Andrew Palumbo <[email protected]> wrote:
> >
> > It doesn't look like it has anything to do with the conversion.
> >
> > after:
> >
> > val rowBindings = d.map(t => (t._1._1.toString, t._2:
> java.lang.Integer)).toMap
> >
> > rowBindings.size is one
> >
> > From: [email protected]
> > To: [email protected]
> > Subject: RE: drmFromHDFS rowLabelBindings question
> > Date: Fri, 12 Sep 2014 15:53:48 -0400
> >
> >
> >
> >
> > Thanks guys, I was wondering about the java.util.Map conversion too.
> I'll try copying everything into a java.util.HashMap and passing that to
> setRowBindings. I'll play around with it and if i cant get it to work,
> I'll file a jira.
> >
> > I'm just using it in the NB implementation so its not a pressing issue.
> >
> > Appreciate it.
> >
> > > Date: Fri, 12 Sep 2014 12:35:21 -0700
> > > Subject: Re: drmFromHDFS rowLabelBindings question
> > > From: [email protected]
> > > To: [email protected]
> > >
> > > On Fri, Sep 12, 2014 at 12:17 PM, Anand Avati <[email protected]>
> wrote:
> > >
> > >>
> > >>
> > >> On Fri, Sep 12, 2014 at 12:00 PM, Anand Avati <[email protected]>
> wrote:
> > >>
> > >>>
> > >>>
> > >>> On Fri, Sep 12, 2014 at 11:57 AM, Dmitriy Lyubimov <
> [email protected]>
> > >>> wrote:
> > >>>
> > >>>> bit i you are really compelled that it is something that might be
> needed,
> > >>>> the best way probably would be indeed create an optional parameter
> to
> > >>>> collect (something like
> drmLike.collect(extractLabels:Boolean=false))
> > >>>> which
> > >>>> you can flip to true if needed and the thing does toString on keys
> and
> > >>>> assinging them to in-core matrix' row labels. (requires a patch of
> > >>>> course)
> > >>>>
> > >>>>
> > >>> As I mentioned in the other mail, this is already the case. The code
> > >>> seems to assume .toMap internally does collect. My (somewhat wild)
> > >>> suspicion is that this line is somehow fooling the eye:
> > >>>
> > >>> val rowBindings = d.map(t => (t._1._1.toString, t._2:
> java.lang.Integer)).toMap
> > >>>
> > >>>
> > >>>
> > >> Argh, for a moment I was thinking `d` is still an rdd. It is actually
> all
> > >> in-core, as the entirety of the rdd is collected up front into
> `data`. In
> > >> any case I suspect the non-int key collecting code might be doing
> something
> > >> funny.
> > >>
> > >
> > > One problem I see is that toMap() returns scala.collections.Map,
> whereas
> > > the next line, m.setRowLabelBindings accepts a java.util.Map. Since the
> > > code compiles fine there is probably an implicit conversion happening
> > > somewhere, and I dont know if the conversion is doing the right thing.
> > > Other than this, rest of the code seems to look fine.
> >
> >
>
>
>