When I said Combiner, I meant kind of a makeshift combiner using mapBlock.
looks like something like this: 

   // encodedCategoryByRowIndexVector is a Vector of Int-encoded
   // categories where each index corresponds to a row in the DRM. 

    val ncategories = categoryIndex.toInt
    val BCastEncodedCategoryByRowVector=     
drmBroadcast(encodedCategoryByRowIndexVector)

    val aggregetedObservationByLabelDrm = intKeyedObservations.t.mapBlock(ncol 
= ncategories) {
      case (keys, blockA) =>
        val blockB = blockA.like(keys.size, ncategories)
        //val blockB = new SparseRowMatrix((keys.size, ncategories)
        var category : Int = 0
        for (i <- 0 until keys.size) {
          // todo: Should probably use nonZeroes here as well
          for (j <- 0 until blockA.ncol) {
            category = BCastEncodedCategoryByRowVector.get(j).toInt
            blockB.setQuick(i, category, (blockB.get(i,category) + 
blockA.get(i,j)))
          }
        }
        keys -> blockB
    }

   aggregetedObservationByLabelDrm.t

I'm heading out now and I'll try to put a PR up tomorrow.. I'm not even sure if 
this works yet... just sketching it out to get something working.  Then we can 
easily override this in and do the I/O specific work in  Spark/H2o/flink and do 
the work in the best way for those engines.  

I don't know - maybe this should be included in the I/O.. Its really is where 
much of parallelized heavy lifting is done in NB training.





> Date: Fri, 12 Sep 2014 16:53:11 -0700
> Subject: Re: drmFromHDFS rowLabelBindings question
> From: [email protected]
> To: [email protected]
> 
> >
> > Note that there is no way (yet) to perform aggregate or reduce like
> > operation through the DSL. Though the backends (both spark and h2o) support
> > reduce-like operations, there is no DSL operator for that yet. We could
> > either introduce a reduce/aggregate operator in as engine neutral/close to
> >
> 
> we already discussed that. Big NO
> 
> (1) Engines differ in shuffle task capabilities and specifics. A LOT. It is
> my belief that finding common denominator here is way to a rat hole with no
> real bottom. mapBlock(), which  is translation to map task, is the only
> clean exception and actually pretty useful as well.
> 
> (2) We are for R-Like algebra, not functional programming.
> 
> (3) Mixing in non-algebraic primitives will break laws of algebraic
> optimization. (well, mapBlock(), binds and splits kinda do today and are
> de-facto checkpoints, it wold take really a lot to optimize them over,
> although ti is definitely possible in a lot of cituations).
> 
> (4) No need. This is probably the most compelling reason.
> we do expect quasi-algebraic methods to be inevitable anyway, so one is to
> use `rdd` property and do whatever his heart desires, with full engine
> caps. Most methods do just that, happily enough. Actually, all my methods
> are quasi-algebraic.  Instead of trying to standardize everything, we are
> saying things are going to be quasi, in which case clean component
> separation  (in OOA sense, think Strategy and perhaps Visitor patterns) of
> quasi things and algebraic expressions whould go a long way to alleviate
>  porting non-algebraic parts to specific engines. In that sense, Pat's
> stuff does not adhere to these patterns so i imagine it would be pretty
> difficult to port it to e.g. flink .
> 
> 
> algebraic way as possible, or keep any kind of reduction/aggregate phase of
> > operation backend specific (which kind of sucks)
> >
> 
> 
> >
> > Thanks
> >
> >
> >
> > > > Subject: Re: drmFromHDFS rowLabelBindings question
> > > > From: [email protected]
> > > > Date: Fri, 12 Sep 2014 14:41:35 -0700
> > > > To: [email protected]
> > > >
> > > > Not sure if this helps but we (Sebastian and I) created an
> > > IndexedDataset which maintains row and column HashBiMaps that use the Int
> > > key to map to/from Strings. There are Reader and Writer traits for file
> > IO
> > > (text files for now). The flow is to read an IndexedDataset using the
> > > Reader trait. Inside the IndexedDataset you have a CheckpointedDrm and
> > two
> > > label BiMaps for rows and columns. This method is used in the row and
> > item
> > > similarity jobs where you do math things like B.t %*% A After you do the
> > > math using the drm contained in the IndexedDataset you assign the correct
> > > dictionaries to the resulting IndexedDataset to maintain your labels for
> > > writing or further math. It might make sense to implement some of the
> > math
> > > ops that would work with this simple approach but in any case you can do
> > it
> > > explicitly as those jobs do. The idea was to support other file formats
> > > like sequence files as the need comes up.
> > > >
> > > > On Sep 12, 2014, at 1:14 PM, Andrew Palumbo <[email protected]>
> > wrote:
> > > >
> > > > It doesn't look like it has anything to do with the conversion.
> > > >
> > > > after:
> > > >
> > > >    val rowBindings = d.map(t => (t._1._1.toString, t._2:
> > > java.lang.Integer)).toMap
> > > >
> > > > rowBindings.size  is one
> > > >
> > > > From: [email protected]
> > > > To: [email protected]
> > > > Subject: RE: drmFromHDFS rowLabelBindings question
> > > > Date: Fri, 12 Sep 2014 15:53:48 -0400
> > > >
> > > >
> > > >
> > > >
> > > > Thanks guys,  I was wondering about the java.util.Map conversion too.
> > > I'll try copying everything into a java.util.HashMap and passing that to
> > > setRowBindings.  I'll play around with it and if i cant get it to work,
> > > I'll file a jira.
> > > >
> > > > I'm just using it in the NB implementation so its not a pressing issue.
> > > >
> > > > Appreciate it.
> > > >
> > > > > Date: Fri, 12 Sep 2014 12:35:21 -0700
> > > > > Subject: Re: drmFromHDFS rowLabelBindings question
> > > > > From: [email protected]
> > > > > To: [email protected]
> > > > >
> > > > > On Fri, Sep 12, 2014 at 12:17 PM, Anand Avati <[email protected]>
> > > wrote:
> > > > >
> > > > >>
> > > > >>
> > > > >> On Fri, Sep 12, 2014 at 12:00 PM, Anand Avati <[email protected]>
> > > wrote:
> > > > >>
> > > > >>>
> > > > >>>
> > > > >>> On Fri, Sep 12, 2014 at 11:57 AM, Dmitriy Lyubimov <
> > > [email protected]>
> > > > >>> wrote:
> > > > >>>
> > > > >>>> bit i you are really compelled that it is something that might be
> > > needed,
> > > > >>>> the best way probably would be indeed create an optional parameter
> > > to
> > > > >>>> collect (something like
> > > drmLike.collect(extractLabels:Boolean=false))
> > > > >>>> which
> > > > >>>> you can flip to true if needed and the thing does toString on keys
> > > and
> > > > >>>> assinging them to in-core matrix' row labels. (requires a patch of
> > > > >>>> course)
> > > > >>>>
> > > > >>>>
> > > > >>> As I mentioned in the other mail, this is already the case. The
> > code
> > > > >>> seems to assume .toMap internally does collect. My (somewhat wild)
> > > > >>> suspicion is that this line is somehow fooling the eye:
> > > > >>>
> > > > >>> val rowBindings = d.map(t => (t._1._1.toString, t._2:
> > > java.lang.Integer)).toMap
> > > > >>>
> > > > >>>
> > > > >>>
> > > > >> Argh, for a moment I was thinking `d` is still an rdd. It is
> > actually
> > > all
> > > > >> in-core, as the entirety of the rdd is collected up front into
> > > `data`. In
> > > > >> any case I suspect the non-int key collecting code might be doing
> > > something
> > > > >> funny.
> > > > >>
> > > > >
> > > > > One problem I see is that toMap() returns scala.collections.Map,
> > > whereas
> > > > > the next line, m.setRowLabelBindings accepts a java.util.Map. Since
> > the
> > > > > code compiles fine there is probably an implicit conversion happening
> > > > > somewhere, and I dont know if the conversion is doing the right
> > thing.
> > > > > Other than this, rest of the code seems to look fine.
> > > >
> > > >
> > >
> > >
> > >
> >aggregetedObservationByLabelDrmaggregetedObservationByLabelDrmaggregetedObservationByLabelDrm
> >                                          

Reply via email to