[jira] [Comment Edited] (MAHOUT-1346) Spark Bindings (DRM)

Dmitriy Lyubimov (JIRA) Tue, 12 Nov 2013 13:52:11 -0800

    [ 
https://issues.apache.org/jira/browse/MAHOUT-1346?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13820475#comment-13820475
 ]


Dmitriy Lyubimov edited comment on MAHOUT-1346 at 11/12/13 9:50 PM:
--------------------------------------------------------------------

https://github.com/dlyubimov/mahout-commits/tree/dev-0.9.x-scala

i started moving some things there. In particular, ALS is still not there 
(still haven't hashed it out with my boss). but there some inital matrix 
algorithms to be picked up (even transposition can be blockified and improved). 

Anyone wanting to give me a hand on this?

Please dont pick weighted ALS-WR so far, i still hope to finish porting it. 

There are more interesting questions there, like parameter validation and 
fitting. 
Common problem i have is that suppose you have the implicit feedback approach. 
Then you reformulate it in terms of preference (P) and confidence (C) inputs. 
The original paper speaks of a specific scheme of forming C that includes one 
parameter they want to fit. 

More interesting question is, what if we have more than one parameter? I.e. 
what if we have a bunch of user behavior, suppose, an item search, browse, 
click, add2card, and finally, aquisition. That's a whole bunch of parameters to 
form confidence of user's preference. I.e. it is reasonable to assume that e.g. 
since every transaction preceeds by add2card, add2card signifies a positive 
preference in general (we are just far less confident about that). Then again, 
abandoned cart may also signify a negative preference, or nothing at all.

Anyway. suppose we want to perform exploration what's worth what. Natural way 
is to do it, again, thru a crossvalidation . Posing such a problem presents a 
whole new look at "Big Data ML" problems. Now we are using distributed 
processing not just because the input might be so big, but also because we have 
a lot of parameter space exploration to do (even if the one iteration problem 
is not so big). And thus produce more interesting analytical results.

However, since there are many parameters, the task becomes fairly more 
interesting. since there is not  so much test data (we still should assume we 
will have just a handful of crossvalidation runs) various "online" convex 
searching techniques like SGD or BFGS are not going to be very viable. what i 
was thinking of, maybe we can start running parallel tries and fit the data 
into paraboloids (i.e. second degree polynomial regression without interaction 
terms). That might be a big assumption but that would be enough to get a 
general sense where global maximum may be even on inputs of a fairly small 
size. Of course we may discover hyperbolic parabaloid properties along some 
parameter axes. in which case it would mean we got the preference wrong, so we 
flip the preference mapping. (i.e. click = (P=1, C=0.5) would flip into click = 
(P=0, C=0...) and re-validate again.  This is kind of multidimensional 
variation of one-parameter second degree polynom fitting that Raphael refered 
to once. 

We are taking on a lot of assumptions here (parameter independence, existence 
of a good global maximum etc. etc). Perhaps there's something better to 
automate that search? 

thanks . 
-Dmitriy


was (Author: dlyubimov):
https://github.com/dlyubimov/mahout-commits/tree/dev-0.9.x-scala

i started moving some things there. In particular, ALS is still not there 
(still haven't hashed it out with my boss). but there some inital matrix 
algorithms to be picked up (even transposition can be blockified and improved). 

Anyone wanting to give me a hand on this?

Please dont pick weighted ALS-WR so far, i still hope to finish porting it. 

There are more interesting questions there, like parameter validation and 
fitting. 
Common problem i have is that suppose you have the implicit feedback approach. 
Then you reformulate it in terms of preference (P) and confidence (C) inputs. 
The original paper speaks of a specific scheme of forming C that includes one 
parameter they want to fit. 

More interesting question is, what if we have more than one parameter? I.e. 
what if we have a bunch of user behavior, suppose, an item search, browse, 
click, add2card, and finally, aquisition. That's a whole bunch of parameters to 
form confidence of user's preference. I.e. it is reasonable to assume that e.g. 
since every transaction preceeds by add2card, add2card signifies a positive 
preference in general (we are just far less confident about that). Then again, 
abandoned cart may also signify a negative preference, or nothing at all.

Anyway. suppose we want to perform exploration what's worth what. Natural way 
is to do it, again, thru a crossvalidation . Posing such a problem presents a 
whole new look at "Big Data ML" problems. Now we are using distributed 
processing not just because the input might be so big, but also because we have 
a lot of parameter space exploration to do (even if the one iteration problem 
is not so big). And thus produce more interesting analytical results.

However, since there are many parameters, the task becomes fairly less 
interesting. since there is not  so much test data (we still should assume we 
will have just a handful of crossvalidation runs) various "online" convex 
searching techniques like SGD or BFGS are not going to be very viable. what i 
was thinking of, maybe we can start runnig parallel tries and fit the data into 
paraboloids (i.e. second degree polynomial regression without interaction 
terms). That might be a big assumption but that would be enough. Of course we 
may discover hyperbolic parabaloid properties along some parameter axes. in 
which case it would mean we got the preference wrong, so we flip the preference 
mapping. (i.e. click = (P=1, C=0.5) would flip into click = (P=0, C=0...) and 
re-validate again.  This is kind of multidimensional variation of one-parameter 
second degree polynom fitting that Raphael refered to once. 

We are taking on a lot of assumptions here (parameter independence, existence 
of a good global maximum etc. etc). Perhaps there's something better to 
automate that search? 

thanks . 
-Dmitriy

> Spark Bindings (DRM)
> --------------------
>
>                 Key: MAHOUT-1346
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-1346
>             Project: Mahout
>          Issue Type: Improvement
>    Affects Versions: 0.8
>            Reporter: Dmitriy Lyubimov
>            Assignee: Dmitriy Lyubimov
>             Fix For: Backlog
>
>
> Spark bindings for Mahout DRM. 
> DRM DSL. 
> Disclaimer. This will all be experimental at this point.
> The idea is to wrap DRM by Spark RDD with support of some basic 
> functionality, perhaps some humble beginning of Cost-based optimizer 
> (0) Spark serialization support for Vector, Matrix 
> (1) Bagel transposition 
> (2) slim X'X
> (2a) not-so-slim X'X
> (3) blockify() (compose RDD containing vertical blocks of original input)
> (4) read/write Mahout DRM off HDFS
> (5) A'B
> ...



--
This message was sent by Atlassian JIRA
(v6.1#6144)

[jira] [Comment Edited] (MAHOUT-1346) Spark Bindings (DRM)

Reply via email to