> All the discussion about engine neutral and specific bits is only going to
> come up more and more. Dmitriy speaks for the neutrality of “math” by which I
> take it to mean “math-scala” and stuff in the DSL. Maybe engine neutral bits
> that don’t fit in that can be put in another module to save fighting over it.
> I once proposed “core-scala”. For that matter cooccurrence isn’t really math
> or DSL (maybe that’s what D means by quasi) and so might be better put in
> core-scala too. Inclusion means the code uses but does not extend the DSL and
> the pom doesn’t include an engine
>
I think that this makes sense if we want to put the engine neutral sections of
our underlying classifier/clustering/recommender algorithms and maybe some I/O
traits into a separate module and keep the DSL and the more purely algebraic
algos separate in mahout-math. Then we can just mirror the packages and extend
them (and their test suites as Dmitriy did with the math-scala tests) into
h2o/spark/flink modules. Then we can do the do the as needed engine specific
optimization and engine specific I/O there.
Same as what is being done now with mahout-math.
I would think that it would be important to have complete algorithms (with
empty I/O traits?) in the engine-neutral packages which is why i've been trying
to implement the clunky extractLabelsAndAggregateObservations method in naive
bayes for math-scala in an engine agnostic way.
> On Sep 12, 2014, at 6:44 PM, ap.dev <[email protected]> wrote:
>
> Oh thx- I thought indexedDatasets were spark specific.
>
>
> Sent from my Verizon Wireless 4G LTE smartphone
>
> <div>-------- Original message --------</div><div>From: Pat Ferrel
> <[email protected]> </div><div>Date:09/12/2014 7:52 PM (GMT-05:00)
> </div><div>To: [email protected] </div><div>Subject: Re: drmFromHDFS
> rowLabelBindings question </div><div>
> </div>
> The serialization can be in engine specific modules as with cooccurrence and
> ItemSimiarity. cooccurrence is in math-scala, ItemSmilarity is the engine
> specific driver. There is nothing engine specific about IndexedDatasets and
> an optimization that is not made yet is to allow one or no dictionaries where
> the keys suffice.
>
> Not sure what you want for initial input but you could start with a driver in
> the engine specific spark module, read in the IndexedDataset then pass it to
> your math code, work with the CheckpointedDrm using the DSL and dictionary
> then when done return an IndexedDataset to the driver for serialization.
>
> There’s also no reason that the serialization couldn’t also be implemented in
> H20, in fact I think it would be easier since they have richer text files
> types than Spark.
>
> Anand’s point about reducers is going to require either divergence or more
> engine neutral abstractions. I think serialization is in the same boat.
>
> On Sep 12, 2014, at 4:31 PM, Anand Avati <[email protected]> wrote:
>
> On Fri, Sep 12, 2014 at 4:12 PM, Andrew Palumbo <[email protected]> wrote:
>
> >
> >
> >
> > Thanks- I've been looking at that a bit .. It probably would make things
> > a whole lot easier but I'm working on Naive Bayes, and trying to keep
> > it in the math-scala package (I don't know how well this is going to
> > work because I haven't made my way to model serialization yet).
> >
> > Thinking
> > more of it though using an indexed dataset might make online
> > training/updating the of the weights a whole lot easier if we end up
> > implementing that.
> >
> > Also I think that an IndexedDataset will
> > probably be useful for classifying new documents where we do need to
> > keep the dictionary in memory.
> >
> > Right now, I just need the
> > labels up front in a vector so that i can extract the category and
> > broadcast a categoryByRowindex Vector out to a combiner using something
> > like:
> >
> > IntKeyedTFIDFDrm.t.mapBlock(ncols=numcategories){
> > // aggregate cols by category}.t
> >
> > After
> > that we only need a relatively small Vector or Map of rows(Categories)
> > and don't need column labels as long as we're using seq2sparse. It may
> > make sense though to use something like an IndexedDataset here in the
> > future if we want to move away from seq2sparse in its current
> > implementation.
> >
> > I'm honestly not sure how well this label
> > extraction and aggregation is going to turn out performance-wise.. But
> > my thinking was that we can put an implementation in math-scala and then
> > extend and optimize it in spark if we want ie. rather than writing a
> > combiner using mapBlock- use spark's reduceByKey.
> >
>
> Note that there is no way (yet) to perform aggregate or reduce like
> operation through the DSL. Though the backends (both spark and h2o) support
> reduce-like operations, there is no DSL operator for that yet. We could
> either introduce a reduce/aggregate operator in as engine neutral/close to
> algebraic way as possible, or keep any kind of reduction/aggregate phase of
> operation backend specific (which kind of sucks)
>
> Thanks
>
>
>
> >> Subject: Re: drmFromHDFS rowLabelBindings question
> >> From: [email protected]
> >> Date: Fri, 12 Sep 2014 14:41:35 -0700
> >> To: [email protected]
> >>
> >> Not sure if this helps but we (Sebastian and I) created an
> > IndexedDataset which maintains row and column HashBiMaps that use the Int
> > key to map to/from Strings. There are Reader and Writer traits for file IO
> > (text files for now). The flow is to read an IndexedDataset using the
> > Reader trait. Inside the IndexedDataset you have a CheckpointedDrm and two
> > label BiMaps for rows and columns. This method is used in the row and item
> > similarity jobs where you do math things like B.t %*% A After you do the
> > math using the drm contained in the IndexedDataset you assign the correct
> > dictionaries to the resulting IndexedDataset to maintain your labels for
> > writing or further math. It might make sense to implement some of the math
> > ops that would work with this simple approach but in any case you can do it
> > explicitly as those jobs do. The idea was to support other file formats
> > like sequence files as the need comes up.
> >>
> >> On Sep 12, 2014, at 1:14 PM, Andrew Palumbo <[email protected]> wrote:
> >>
> >> It doesn't look like it has anything to do with the conversion.
> >>
> >> after:
> >>
> >> val rowBindings = d.map(t => (t._1._1.toString, t._2:
> > java.lang.Integer)).toMap
> >>
> >> rowBindings.size is one
> >>
> >> From: [email protected]
> >> To: [email protected]
> >> Subject: RE: drmFromHDFS rowLabelBindings question
> >> Date: Fri, 12 Sep 2014 15:53:48 -0400
> >>
> >>
> >>
> >>
> >> Thanks guys, I was wondering about the java.util.Map conversion too.
> > I'll try copying everything into a java.util.HashMap and passing that to
> > setRowBindings. I'll play around with it and if i cant get it to work,
> > I'll file a jira.
> >>
> >> I'm just using it in the NB implementation so its not a pressing issue.
> >>
> >> Appreciate it.
> >>
> >>> Date: Fri, 12 Sep 2014 12:35:21 -0700
> >>> Subject: Re: drmFromHDFS rowLabelBindings question
> >>> From: [email protected]
> >>> To: [email protected]
> >>>
> >>> On Fri, Sep 12, 2014 at 12:17 PM, Anand Avati <[email protected]>
> > wrote:
> >>>
> >>>>
> >>>>
> >>>> On Fri, Sep 12, 2014 at 12:00 PM, Anand Avati <[email protected]>
> > wrote:
> >>>>
> >>>>>
> >>>>>
> >>>>> On Fri, Sep 12, 2014 at 11:57 AM, Dmitriy Lyubimov <
> > [email protected]>
> >>>>> wrote:
> >>>>>
> >>>>>> bit i you are really compelled that it is something that might be
> > needed,
> >>>>>> the best way probably would be indeed create an optional parameter
> > to
> >>>>>> collect (something like
> > drmLike.collect(extractLabels:Boolean=false))
> >>>>>> which
> >>>>>> you can flip to true if needed and the thing does toString on keys
> > and
> >>>>>> assinging them to in-core matrix' row labels. (requires a patch of
> >>>>>> course)
> >>>>>>
> >>>>>>
> >>>>> As I mentioned in the other mail, this is already the case. The code
> >>>>> seems to assume .toMap internally does collect. My (somewhat wild)
> >>>>> suspicion is that this line is somehow fooling the eye:
> >>>>>
> >>>>> val rowBindings = d.map(t => (t._1._1.toString, t._2:
> > java.lang.Integer)).toMap
> >>>>>
> >>>>>
> >>>>>
> >>>> Argh, for a moment I was thinking `d` is still an rdd. It is actually
> > all
> >>>> in-core, as the entirety of the rdd is collected up front into
> > `data`. In
> >>>> any case I suspect the non-int key collecting code might be doing
> > something
> >>>> funny.
> >>>>
> >>>
> >>> One problem I see is that toMap() returns scala.collections.Map,
> > whereas
> >>> the next line, m.setRowLabelBindings accepts a java.util.Map. Since the
> >>> code compiles fine there is probably an implicit conversion happening
> >>> somewhere, and I dont know if the conversion is doing the right thing.
> >>> Other than this, rest of the code seems to look fine.
> >>
> >>
> >
> >
> >
>
>