oops, sent before getting up to date. So we’ll move this stuff into math-scala. Works for me.
On Sep 15, 2014, at 11:28 AM, Pat Ferrel <[email protected]> wrote: I need to clean up some of this as far as the packaging. The base reader/writer mostly abstract traits are engine neutral along with IndexedDataset. These are clearly not math. I could even add the seqfile reader/writer pretty easily into the same class/trait packaging since they are already implemented in math. In order to mix in with legacy code there will be a need for sequence file readers and writers but moving forward, especially since intermediate results are generally not put into files, isn't text a better way to go? It’s really only for import/export. Should we create a core-scala? I’d be up for that. I’d move cf/cooccurrence, reader/writer base, schemas and the defaults, IndexedDataset, MahoutDriver, MahoutParser there leaving only the Spark implementing code in spark. On Sep 13, 2014, at 12:18 PM, Andrew Palumbo <[email protected]> wrote: > All the discussion about engine neutral and specific bits is only going to > come up more and more. Dmitriy speaks for the neutrality of “math” by which I > take it to mean “math-scala” and stuff in the DSL. Maybe engine neutral bits > that don’t fit in that can be put in another module to save fighting over it. > I once proposed “core-scala”. For that matter cooccurrence isn’t really math > or DSL (maybe that’s what D means by quasi) and so might be better put in > core-scala too. Inclusion means the code uses but does not extend the DSL and > the pom doesn’t include an engine > I think that this makes sense if we want to put the engine neutral sections of our underlying classifier/clustering/recommender algorithms and maybe some I/O traits into a separate module and keep the DSL and the more purely algebraic algos separate in mahout-math. Then we can just mirror the packages and extend them (and their test suites as Dmitriy did with the math-scala tests) into h2o/spark/flink modules. Then we can do the do the as needed engine specific optimization and engine specific I/O there. Same as what is being done now with mahout-math. I would think that it would be important to have complete algorithms (with empty I/O traits?) in the engine-neutral packages which is why i've been trying to implement the clunky extractLabelsAndAggregateObservations method in naive bayes for math-scala in an engine agnostic way. > On Sep 12, 2014, at 6:44 PM, ap.dev <[email protected]> wrote: > > Oh thx- I thought indexedDatasets were spark specific. > > > Sent from my Verizon Wireless 4G LTE smartphone > > <div>-------- Original message --------</div><div>From: Pat Ferrel > <[email protected]> </div><div>Date:09/12/2014 7:52 PM (GMT-05:00) > </div><div>To: [email protected] </div><div>Subject: Re: drmFromHDFS > rowLabelBindings question </div><div> > </div> > The serialization can be in engine specific modules as with cooccurrence and > ItemSimiarity. cooccurrence is in math-scala, ItemSmilarity is the engine > specific driver. There is nothing engine specific about IndexedDatasets and > an optimization that is not made yet is to allow one or no dictionaries where > the keys suffice. > > Not sure what you want for initial input but you could start with a driver in > the engine specific spark module, read in the IndexedDataset then pass it to > your math code, work with the CheckpointedDrm using the DSL and dictionary > then when done return an IndexedDataset to the driver for serialization. > > There’s also no reason that the serialization couldn’t also be implemented in > H20, in fact I think it would be easier since they have richer text files > types than Spark. > > Anand’s point about reducers is going to require either divergence or more > engine neutral abstractions. I think serialization is in the same boat. > > On Sep 12, 2014, at 4:31 PM, Anand Avati <[email protected]> wrote: > > On Fri, Sep 12, 2014 at 4:12 PM, Andrew Palumbo <[email protected]> wrote: > >> >> >> >> Thanks- I've been looking at that a bit .. It probably would make things >> a whole lot easier but I'm working on Naive Bayes, and trying to keep >> it in the math-scala package (I don't know how well this is going to >> work because I haven't made my way to model serialization yet). >> >> Thinking >> more of it though using an indexed dataset might make online >> training/updating the of the weights a whole lot easier if we end up >> implementing that. >> >> Also I think that an IndexedDataset will >> probably be useful for classifying new documents where we do need to >> keep the dictionary in memory. >> >> Right now, I just need the >> labels up front in a vector so that i can extract the category and >> broadcast a categoryByRowindex Vector out to a combiner using something >> like: >> >> IntKeyedTFIDFDrm.t.mapBlock(ncols=numcategories){ >> // aggregate cols by category}.t >> >> After >> that we only need a relatively small Vector or Map of rows(Categories) >> and don't need column labels as long as we're using seq2sparse. It may >> make sense though to use something like an IndexedDataset here in the >> future if we want to move away from seq2sparse in its current >> implementation. >> >> I'm honestly not sure how well this label >> extraction and aggregation is going to turn out performance-wise.. But >> my thinking was that we can put an implementation in math-scala and then >> extend and optimize it in spark if we want ie. rather than writing a >> combiner using mapBlock- use spark's reduceByKey. >> > > Note that there is no way (yet) to perform aggregate or reduce like > operation through the DSL. Though the backends (both spark and h2o) support > reduce-like operations, there is no DSL operator for that yet. We could > either introduce a reduce/aggregate operator in as engine neutral/close to > algebraic way as possible, or keep any kind of reduction/aggregate phase of > operation backend specific (which kind of sucks) > > Thanks > > > >>> Subject: Re: drmFromHDFS rowLabelBindings question >>> From: [email protected] >>> Date: Fri, 12 Sep 2014 14:41:35 -0700 >>> To: [email protected] >>> >>> Not sure if this helps but we (Sebastian and I) created an >> IndexedDataset which maintains row and column HashBiMaps that use the Int >> key to map to/from Strings. There are Reader and Writer traits for file IO >> (text files for now). The flow is to read an IndexedDataset using the >> Reader trait. Inside the IndexedDataset you have a CheckpointedDrm and two >> label BiMaps for rows and columns. This method is used in the row and item >> similarity jobs where you do math things like B.t %*% A After you do the >> math using the drm contained in the IndexedDataset you assign the correct >> dictionaries to the resulting IndexedDataset to maintain your labels for >> writing or further math. It might make sense to implement some of the math >> ops that would work with this simple approach but in any case you can do it >> explicitly as those jobs do. The idea was to support other file formats >> like sequence files as the need comes up. >>> >>> On Sep 12, 2014, at 1:14 PM, Andrew Palumbo <[email protected]> wrote: >>> >>> It doesn't look like it has anything to do with the conversion. >>> >>> after: >>> >>> val rowBindings = d.map(t => (t._1._1.toString, t._2: >> java.lang.Integer)).toMap >>> >>> rowBindings.size is one >>> >>> From: [email protected] >>> To: [email protected] >>> Subject: RE: drmFromHDFS rowLabelBindings question >>> Date: Fri, 12 Sep 2014 15:53:48 -0400 >>> >>> >>> >>> >>> Thanks guys, I was wondering about the java.util.Map conversion too. >> I'll try copying everything into a java.util.HashMap and passing that to >> setRowBindings. I'll play around with it and if i cant get it to work, >> I'll file a jira. >>> >>> I'm just using it in the NB implementation so its not a pressing issue. >>> >>> Appreciate it. >>> >>>> Date: Fri, 12 Sep 2014 12:35:21 -0700 >>>> Subject: Re: drmFromHDFS rowLabelBindings question >>>> From: [email protected] >>>> To: [email protected] >>>> >>>> On Fri, Sep 12, 2014 at 12:17 PM, Anand Avati <[email protected]> >> wrote: >>>> >>>>> >>>>> >>>>> On Fri, Sep 12, 2014 at 12:00 PM, Anand Avati <[email protected]> >> wrote: >>>>> >>>>>> >>>>>> >>>>>> On Fri, Sep 12, 2014 at 11:57 AM, Dmitriy Lyubimov < >> [email protected]> >>>>>> wrote: >>>>>> >>>>>>> bit i you are really compelled that it is something that might be >> needed, >>>>>>> the best way probably would be indeed create an optional parameter >> to >>>>>>> collect (something like >> drmLike.collect(extractLabels:Boolean=false)) >>>>>>> which >>>>>>> you can flip to true if needed and the thing does toString on keys >> and >>>>>>> assinging them to in-core matrix' row labels. (requires a patch of >>>>>>> course) >>>>>>> >>>>>>> >>>>>> As I mentioned in the other mail, this is already the case. The code >>>>>> seems to assume .toMap internally does collect. My (somewhat wild) >>>>>> suspicion is that this line is somehow fooling the eye: >>>>>> >>>>>> val rowBindings = d.map(t => (t._1._1.toString, t._2: >> java.lang.Integer)).toMap >>>>>> >>>>>> >>>>>> >>>>> Argh, for a moment I was thinking `d` is still an rdd. It is actually >> all >>>>> in-core, as the entirety of the rdd is collected up front into >> `data`. In >>>>> any case I suspect the non-int key collecting code might be doing >> something >>>>> funny. >>>>> >>>> >>>> One problem I see is that toMap() returns scala.collections.Map, >> whereas >>>> the next line, m.setRowLabelBindings accepts a java.util.Map. Since the >>>> code compiles fine there is probably an implicit conversion happening >>>> somewhere, and I dont know if the conversion is doing the right thing. >>>> Other than this, rest of the code seems to look fine. >>> >>> >> >> >> > >
