Re: drmFromHDFS rowLabelBindings question

Pat Ferrel Mon, 15 Sep 2014 12:44:41 -0700

oops, sent before getting up to date. So we’ll move this stuff into math-scala. 
Works for me.



On Sep 15, 2014, at 11:28 AM, Pat Ferrel <[email protected]> wrote:

I need to clean up some of this as far as the packaging. The base reader/writer 
mostly abstract traits are engine neutral along with IndexedDataset. These are 
clearly not math. I could even add the seqfile reader/writer pretty easily into 
the same class/trait packaging since they are already implemented in math. In 
order to mix in with legacy code there will be a need for sequence file readers 
and writers but moving forward, especially since intermediate results are 
generally not put into files, isn't text a better way to go? It’s really only 
for import/export.

Should we create a core-scala? I’d be up for that. I’d move cf/cooccurrence, 
reader/writer base, schemas and the defaults, IndexedDataset, MahoutDriver, 
MahoutParser there leaving only the Spark implementing code in spark.

On Sep 13, 2014, at 12:18 PM, Andrew Palumbo <[email protected]> wrote:










> All the discussion about engine neutral and specific bits is only going to 
> come up more and more. Dmitriy speaks for the neutrality of “math” by which I 
> take it to mean “math-scala” and stuff in the DSL. Maybe engine neutral bits 
> that don’t fit in that can be put in another module to save fighting over it. 
> I once proposed “core-scala”. For that matter cooccurrence isn’t really math 
> or DSL (maybe that’s what D means by quasi) and so might be better put in 
> core-scala too. Inclusion means the code uses but does not extend the DSL and 
> the pom doesn’t include an engine
> 

I think that this makes sense if we want to put the engine neutral sections of 
our underlying classifier/clustering/recommender algorithms and maybe some I/O 
traits into a separate module and keep the DSL and the more purely algebraic 
algos separate in mahout-math.  Then we can just mirror the packages and extend 
them (and their test suites as Dmitriy did with the math-scala tests) into 
h2o/spark/flink modules.  Then we can do the do the as needed engine specific 
optimization and engine specific I/O there. 

Same as what is being done now with mahout-math.

I would think that it would be important to have complete algorithms (with 
empty I/O traits?) in the engine-neutral packages which is why i've been trying 
to implement the clunky extractLabelsAndAggregateObservations method in naive 
bayes for math-scala in an engine agnostic way.





> On Sep 12, 2014, at 6:44 PM, ap.dev <[email protected]> wrote:
> 
> Oh thx-  I thought indexedDatasets were spark specific.
> 
> 
> Sent from my Verizon Wireless 4G LTE smartphone
> 
> <div>-------- Original message --------</div><div>From: Pat Ferrel 
> <[email protected]> </div><div>Date:09/12/2014  7:52 PM  (GMT-05:00) 
> </div><div>To: [email protected] </div><div>Subject: Re: drmFromHDFS 
> rowLabelBindings question </div><div>
> </div>
> The serialization can be in engine specific modules as with cooccurrence and 
> ItemSimiarity. cooccurrence is in math-scala, ItemSmilarity is the engine 
> specific driver. There is nothing engine specific about IndexedDatasets and 
> an optimization that is not made yet is to allow one or no dictionaries where 
> the keys suffice.
> 
> Not sure what you want for initial input but you could start with a driver in 
> the engine specific spark module, read in the IndexedDataset then pass it to 
> your math code, work with the CheckpointedDrm using the DSL and dictionary 
> then when done return an IndexedDataset to the driver for serialization.
> 
> There’s also no reason that the serialization couldn’t also be implemented in 
> H20, in fact I think it would be easier since they have richer text files 
> types than Spark.
> 
> Anand’s point about reducers is going to require either divergence or more 
> engine neutral abstractions. I think serialization is in the same boat.
> 
> On Sep 12, 2014, at 4:31 PM, Anand Avati <[email protected]> wrote:
> 
> On Fri, Sep 12, 2014 at 4:12 PM, Andrew Palumbo <[email protected]> wrote:
> 
>> 
>> 
>> 
>> Thanks- I've been looking at that a bit .. It probably would make things
>> a whole lot easier but I'm working on Naive Bayes, and  trying to keep
>> it in the math-scala package (I don't know how well this is going to
>> work because I haven't made my way to model serialization yet).
>> 
>> Thinking
>> more of it though using an indexed dataset might make online
>> training/updating the of the weights a whole lot easier if we end up
>> implementing that.
>> 
>> Also I think that an IndexedDataset will
>> probably be useful for classifying new documents where we do need to
>> keep the dictionary in memory.
>> 
>> Right now, I just need  the
>> labels up front in a vector so that i can extract the category and
>> broadcast a categoryByRowindex Vector out to a combiner using something
>> like:
>> 
>> IntKeyedTFIDFDrm.t.mapBlock(ncols=numcategories){
>>     // aggregate cols by category}.t
>> 
>> After
>> that we only need a relatively small Vector or Map of rows(Categories)
>> and don't need column labels as long as we're using seq2sparse.  It may
>> make sense though to use something like an IndexedDataset here in the
>> future if we want to move away from seq2sparse in its current
>> implementation.
>> 
>> I'm honestly not sure how well this label
>> extraction and aggregation is going to turn out performance-wise..  But
>> my thinking was that we can put an implementation in math-scala and then
>> extend and optimize it in spark if we want ie. rather than writing a
>> combiner using mapBlock- use spark's reduceByKey.
>> 
> 
> Note that there is no way (yet) to perform aggregate or reduce like
> operation through the DSL. Though the backends (both spark and h2o) support
> reduce-like operations, there is no DSL operator for that yet. We could
> either introduce a reduce/aggregate operator in as engine neutral/close to
> algebraic way as possible, or keep any kind of reduction/aggregate phase of
> operation backend specific (which kind of sucks)
> 
> Thanks
> 
> 
> 
>>> Subject: Re: drmFromHDFS rowLabelBindings question
>>> From: [email protected]
>>> Date: Fri, 12 Sep 2014 14:41:35 -0700
>>> To: [email protected]
>>> 
>>> Not sure if this helps but we (Sebastian and I) created an
>> IndexedDataset which maintains row and column HashBiMaps that use the Int
>> key to map to/from Strings. There are Reader and Writer traits for file IO
>> (text files for now). The flow is to read an IndexedDataset using the
>> Reader trait. Inside the IndexedDataset you have a CheckpointedDrm and two
>> label BiMaps for rows and columns. This method is used in the row and item
>> similarity jobs where you do math things like B.t %*% A After you do the
>> math using the drm contained in the IndexedDataset you assign the correct
>> dictionaries to the resulting IndexedDataset to maintain your labels for
>> writing or further math. It might make sense to implement some of the math
>> ops that would work with this simple approach but in any case you can do it
>> explicitly as those jobs do. The idea was to support other file formats
>> like sequence files as the need comes up.
>>> 
>>> On Sep 12, 2014, at 1:14 PM, Andrew Palumbo <[email protected]> wrote:
>>> 
>>> It doesn't look like it has anything to do with the conversion.
>>> 
>>> after:
>>> 
>>> val rowBindings = d.map(t => (t._1._1.toString, t._2:
>> java.lang.Integer)).toMap
>>> 
>>> rowBindings.size  is one
>>> 
>>> From: [email protected]
>>> To: [email protected]
>>> Subject: RE: drmFromHDFS rowLabelBindings question
>>> Date: Fri, 12 Sep 2014 15:53:48 -0400
>>> 
>>> 
>>> 
>>> 
>>> Thanks guys,  I was wondering about the java.util.Map conversion too.
>> I'll try copying everything into a java.util.HashMap and passing that to
>> setRowBindings.  I'll play around with it and if i cant get it to work,
>> I'll file a jira.
>>> 
>>> I'm just using it in the NB implementation so its not a pressing issue.
>>> 
>>> Appreciate it.
>>> 
>>>> Date: Fri, 12 Sep 2014 12:35:21 -0700
>>>> Subject: Re: drmFromHDFS rowLabelBindings question
>>>> From: [email protected]
>>>> To: [email protected]
>>>> 
>>>> On Fri, Sep 12, 2014 at 12:17 PM, Anand Avati <[email protected]>
>> wrote:
>>>> 
>>>>> 
>>>>> 
>>>>> On Fri, Sep 12, 2014 at 12:00 PM, Anand Avati <[email protected]>
>> wrote:
>>>>> 
>>>>>> 
>>>>>> 
>>>>>> On Fri, Sep 12, 2014 at 11:57 AM, Dmitriy Lyubimov <
>> [email protected]>
>>>>>> wrote:
>>>>>> 
>>>>>>> bit i you are really compelled that it is something that might be
>> needed,
>>>>>>> the best way probably would be indeed create an optional parameter
>> to
>>>>>>> collect (something like
>> drmLike.collect(extractLabels:Boolean=false))
>>>>>>> which
>>>>>>> you can flip to true if needed and the thing does toString on keys
>> and
>>>>>>> assinging them to in-core matrix' row labels. (requires a patch of
>>>>>>> course)
>>>>>>> 
>>>>>>> 
>>>>>> As I mentioned in the other mail, this is already the case. The code
>>>>>> seems to assume .toMap internally does collect. My (somewhat wild)
>>>>>> suspicion is that this line is somehow fooling the eye:
>>>>>> 
>>>>>> val rowBindings = d.map(t => (t._1._1.toString, t._2:
>> java.lang.Integer)).toMap
>>>>>> 
>>>>>> 
>>>>>> 
>>>>> Argh, for a moment I was thinking `d` is still an rdd. It is actually
>> all
>>>>> in-core, as the entirety of the rdd is collected up front into
>> `data`. In
>>>>> any case I suspect the non-int key collecting code might be doing
>> something
>>>>> funny.
>>>>> 
>>>> 
>>>> One problem I see is that toMap() returns scala.collections.Map,
>> whereas
>>>> the next line, m.setRowLabelBindings accepts a java.util.Map. Since the
>>>> code compiles fine there is probably an implicit conversion happening
>>>> somewhere, and I dont know if the conversion is doing the right thing.
>>>> Other than this, rest of the code seems to look fine.
>>> 
>>> 
>> 
>> 
>> 
> 
>

Re: drmFromHDFS rowLabelBindings question

Reply via email to