Thanks- I've been looking at that a bit .. It probably would make things
a whole lot easier but I'm working on Naive Bayes, and trying to keep
it in the math-scala package (I don't know how well this is going to
work because I haven't made my way to model serialization yet).
Thinking
more of it though using an indexed dataset might make online
training/updating the of the weights a whole lot easier if we end up
implementing that.
Also I think that an IndexedDataset will
probably be useful for classifying new documents where we do need to
keep the dictionary in memory.
Right now, I just need the
labels up front in a vector so that i can extract the category and
broadcast a categoryByRowindex Vector out to a combiner using something
like:
IntKeyedTFIDFDrm.t.mapBlock(ncols=numcategories){
// aggregate cols by category}.t
After
that we only need a relatively small Vector or Map of rows(Categories)
and don't need column labels as long as we're using seq2sparse. It may
make sense though to use something like an IndexedDataset here in the
future if we want to move away from seq2sparse in its current
implementation.
I'm honestly not sure how well this label
extraction and aggregation is going to turn out performance-wise.. But
my thinking was that we can put an implementation in math-scala and then
extend and optimize it in spark if we want ie. rather than writing a
combiner using mapBlock- use spark's reduceByKey.
> Subject: Re: drmFromHDFS rowLabelBindings question
> From: [email protected]
> Date: Fri, 12 Sep 2014 14:41:35 -0700
> To: [email protected]
>
> Not sure if this helps but we (Sebastian and I) created an IndexedDataset
> which maintains row and column HashBiMaps that use the Int key to map to/from
> Strings. There are Reader and Writer traits for file IO (text files for now).
> The flow is to read an IndexedDataset using the Reader trait. Inside the
> IndexedDataset you have a CheckpointedDrm and two label BiMaps for rows and
> columns. This method is used in the row and item similarity jobs where you do
> math things like B.t %*% A After you do the math using the drm contained in
> the IndexedDataset you assign the correct dictionaries to the resulting
> IndexedDataset to maintain your labels for writing or further math. It might
> make sense to implement some of the math ops that would work with this simple
> approach but in any case you can do it explicitly as those jobs do. The idea
> was to support other file formats like sequence files as the need comes up.
>
> On Sep 12, 2014, at 1:14 PM, Andrew Palumbo <[email protected]> wrote:
>
> It doesn't look like it has anything to do with the conversion.
>
> after:
>
> val rowBindings = d.map(t => (t._1._1.toString, t._2:
> java.lang.Integer)).toMap
>
> rowBindings.size is one
>
> From: [email protected]
> To: [email protected]
> Subject: RE: drmFromHDFS rowLabelBindings question
> Date: Fri, 12 Sep 2014 15:53:48 -0400
>
>
>
>
> Thanks guys, I was wondering about the java.util.Map conversion too. I'll
> try copying everything into a java.util.HashMap and passing that to
> setRowBindings. I'll play around with it and if i cant get it to work, I'll
> file a jira.
>
> I'm just using it in the NB implementation so its not a pressing issue.
>
> Appreciate it.
>
> > Date: Fri, 12 Sep 2014 12:35:21 -0700
> > Subject: Re: drmFromHDFS rowLabelBindings question
> > From: [email protected]
> > To: [email protected]
> >
> > On Fri, Sep 12, 2014 at 12:17 PM, Anand Avati <[email protected]> wrote:
> >
> >>
> >>
> >> On Fri, Sep 12, 2014 at 12:00 PM, Anand Avati <[email protected]> wrote:
> >>
> >>>
> >>>
> >>> On Fri, Sep 12, 2014 at 11:57 AM, Dmitriy Lyubimov <[email protected]>
> >>> wrote:
> >>>
> >>>> bit i you are really compelled that it is something that might be needed,
> >>>> the best way probably would be indeed create an optional parameter to
> >>>> collect (something like drmLike.collect(extractLabels:Boolean=false))
> >>>> which
> >>>> you can flip to true if needed and the thing does toString on keys and
> >>>> assinging them to in-core matrix' row labels. (requires a patch of
> >>>> course)
> >>>>
> >>>>
> >>> As I mentioned in the other mail, this is already the case. The code
> >>> seems to assume .toMap internally does collect. My (somewhat wild)
> >>> suspicion is that this line is somehow fooling the eye:
> >>>
> >>> val rowBindings = d.map(t => (t._1._1.toString, t._2:
> >>> java.lang.Integer)).toMap
> >>>
> >>>
> >>>
> >> Argh, for a moment I was thinking `d` is still an rdd. It is actually all
> >> in-core, as the entirety of the rdd is collected up front into `data`. In
> >> any case I suspect the non-int key collecting code might be doing something
> >> funny.
> >>
> >
> > One problem I see is that toMap() returns scala.collections.Map, whereas
> > the next line, m.setRowLabelBindings accepts a java.util.Map. Since the
> > code compiles fine there is probably an implicit conversion happening
> > somewhere, and I dont know if the conversion is doing the right thing.
> > Other than this, rest of the code seems to look fine.
>
>
>