The serialization can be in engine specific modules as with cooccurrence and ItemSimiarity. cooccurrence is in math-scala, ItemSmilarity is the engine specific driver. There is nothing engine specific about IndexedDatasets and an optimization that is not made yet is to allow one or no dictionaries where the keys suffice.
Not sure what you want for initial input but you could start with a driver in the engine specific spark module, read in the IndexedDataset then pass it to your math code, work with the CheckpointedDrm using the DSL and dictionary then when done return an IndexedDataset to the driver for serialization. There’s also no reason that the serialization couldn’t also be implemented in H20, in fact I think it would be easier since they have richer text files types than Spark. Anand’s point about reducers is going to require either divergence or more engine neutral abstractions. I think serialization is in the same boat. On Sep 12, 2014, at 4:31 PM, Anand Avati <[email protected]> wrote: On Fri, Sep 12, 2014 at 4:12 PM, Andrew Palumbo <[email protected]> wrote: > > > > Thanks- I've been looking at that a bit .. It probably would make things > a whole lot easier but I'm working on Naive Bayes, and trying to keep > it in the math-scala package (I don't know how well this is going to > work because I haven't made my way to model serialization yet). > > Thinking > more of it though using an indexed dataset might make online > training/updating the of the weights a whole lot easier if we end up > implementing that. > > Also I think that an IndexedDataset will > probably be useful for classifying new documents where we do need to > keep the dictionary in memory. > > Right now, I just need the > labels up front in a vector so that i can extract the category and > broadcast a categoryByRowindex Vector out to a combiner using something > like: > > IntKeyedTFIDFDrm.t.mapBlock(ncols=numcategories){ > // aggregate cols by category}.t > > After > that we only need a relatively small Vector or Map of rows(Categories) > and don't need column labels as long as we're using seq2sparse. It may > make sense though to use something like an IndexedDataset here in the > future if we want to move away from seq2sparse in its current > implementation. > > I'm honestly not sure how well this label > extraction and aggregation is going to turn out performance-wise.. But > my thinking was that we can put an implementation in math-scala and then > extend and optimize it in spark if we want ie. rather than writing a > combiner using mapBlock- use spark's reduceByKey. > Note that there is no way (yet) to perform aggregate or reduce like operation through the DSL. Though the backends (both spark and h2o) support reduce-like operations, there is no DSL operator for that yet. We could either introduce a reduce/aggregate operator in as engine neutral/close to algebraic way as possible, or keep any kind of reduction/aggregate phase of operation backend specific (which kind of sucks) Thanks >> Subject: Re: drmFromHDFS rowLabelBindings question >> From: [email protected] >> Date: Fri, 12 Sep 2014 14:41:35 -0700 >> To: [email protected] >> >> Not sure if this helps but we (Sebastian and I) created an > IndexedDataset which maintains row and column HashBiMaps that use the Int > key to map to/from Strings. There are Reader and Writer traits for file IO > (text files for now). The flow is to read an IndexedDataset using the > Reader trait. Inside the IndexedDataset you have a CheckpointedDrm and two > label BiMaps for rows and columns. This method is used in the row and item > similarity jobs where you do math things like B.t %*% A After you do the > math using the drm contained in the IndexedDataset you assign the correct > dictionaries to the resulting IndexedDataset to maintain your labels for > writing or further math. It might make sense to implement some of the math > ops that would work with this simple approach but in any case you can do it > explicitly as those jobs do. The idea was to support other file formats > like sequence files as the need comes up. >> >> On Sep 12, 2014, at 1:14 PM, Andrew Palumbo <[email protected]> wrote: >> >> It doesn't look like it has anything to do with the conversion. >> >> after: >> >> val rowBindings = d.map(t => (t._1._1.toString, t._2: > java.lang.Integer)).toMap >> >> rowBindings.size is one >> >> From: [email protected] >> To: [email protected] >> Subject: RE: drmFromHDFS rowLabelBindings question >> Date: Fri, 12 Sep 2014 15:53:48 -0400 >> >> >> >> >> Thanks guys, I was wondering about the java.util.Map conversion too. > I'll try copying everything into a java.util.HashMap and passing that to > setRowBindings. I'll play around with it and if i cant get it to work, > I'll file a jira. >> >> I'm just using it in the NB implementation so its not a pressing issue. >> >> Appreciate it. >> >>> Date: Fri, 12 Sep 2014 12:35:21 -0700 >>> Subject: Re: drmFromHDFS rowLabelBindings question >>> From: [email protected] >>> To: [email protected] >>> >>> On Fri, Sep 12, 2014 at 12:17 PM, Anand Avati <[email protected]> > wrote: >>> >>>> >>>> >>>> On Fri, Sep 12, 2014 at 12:00 PM, Anand Avati <[email protected]> > wrote: >>>> >>>>> >>>>> >>>>> On Fri, Sep 12, 2014 at 11:57 AM, Dmitriy Lyubimov < > [email protected]> >>>>> wrote: >>>>> >>>>>> bit i you are really compelled that it is something that might be > needed, >>>>>> the best way probably would be indeed create an optional parameter > to >>>>>> collect (something like > drmLike.collect(extractLabels:Boolean=false)) >>>>>> which >>>>>> you can flip to true if needed and the thing does toString on keys > and >>>>>> assinging them to in-core matrix' row labels. (requires a patch of >>>>>> course) >>>>>> >>>>>> >>>>> As I mentioned in the other mail, this is already the case. The code >>>>> seems to assume .toMap internally does collect. My (somewhat wild) >>>>> suspicion is that this line is somehow fooling the eye: >>>>> >>>>> val rowBindings = d.map(t => (t._1._1.toString, t._2: > java.lang.Integer)).toMap >>>>> >>>>> >>>>> >>>> Argh, for a moment I was thinking `d` is still an rdd. It is actually > all >>>> in-core, as the entirety of the rdd is collected up front into > `data`. In >>>> any case I suspect the non-int key collecting code might be doing > something >>>> funny. >>>> >>> >>> One problem I see is that toMap() returns scala.collections.Map, > whereas >>> the next line, m.setRowLabelBindings accepts a java.util.Map. Since the >>> code compiles fine there is probably an implicit conversion happening >>> somewhere, and I dont know if the conversion is doing the right thing. >>> Other than this, rest of the code seems to look fine. >> >> > > >
