[ https://issues.apache.org/jira/browse/MAHOUT-1346?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13820475#comment-13820475 ]
Dmitriy Lyubimov edited comment on MAHOUT-1346 at 11/12/13 9:31 PM: -------------------------------------------------------------------- https://github.com/dlyubimov/mahout-commits/tree/dev-0.9.x-scala i started moving some things there. In particular, ALS is still not there (still haven't hashed it out with my boss). but there some inital matrix algorithms to be picked up (even transposition can be blockified and improved). Anyone wanting to give me a hand on this? Please dont pick weighted ALS-WR so far, i still hope to finish porting it. There are more interesting questions there, like parameter validation and fitting. Common problem i have is that suppose you have the implicit feedback approach. Then you reformulate it in terms of preference (P) and confidence (C) inputs. The original paper speaks of a specific scheme of forming C that includes one parameter they want to fit. More interesting question is, what if we have more than one parameter? I.e. what if we have a bunch of user behavior, suppose, an item search, browse, click, add2card, and finally, aquisition. That's a whole bunch of parameters to form confidence of user's preference. I.e. it is reasonable to assume that e.g. since every transaction preceeds by add2card, add2card signifies a positive preference in general (we are just far less confident about that). Then again, abandoned cart may also signify a negative preference, or nothing at all. Anyway. suppose we want to perform exploration what's worth what. Natural way is to do it, again, thru a crossvalidation . Posing such a problem presents a whole new look at "Big Data ML" problems. Now we are using distributed processing not just because the input might be so big, but also because we have a lot of parameter space exploration to do (even if the one iteration problem is not so big). And thus produce more interesting analytical results. However, since there are many parameters, the task becomes fairly less interesting. since there is not so much test data (we still should assume we will have just a handful of crossvalidation runs) various "online" convex searching techniques like SGD or BFGS are not going to be very viable. what i was thinking of, maybe we can start runnig parallel tries and fit the data into paraboloids (i.e. second degree polynomial regression without interaction terms). That might be a big assumption but that would be enough. Of course we may discover hyperbolic parabaloid properties along some parameter axes. in which case it would mean we got the preference wrong, so we flip the preference mapping. (i.e. click = (P=1, C=0.5) would flip into click = (P=0, C=0...) and re-validate again. This is kind of multidimensional variation of one-parameter second degree polynom fitting that Raphael refered to once. We are taking on a lot of assumptions here (parameter independence, existence of a good global maximum etc. etc). Perhaps there's something better to automate that search? thanks . -Dmitriy was (Author: dlyubimov): https://github.com/dlyubimov/mahout-commits/tree/dev-0.9.x-scala i started moving some things there. In particular, ALS is still not there (still haven't hashed it out with my boss). but there some inital matrix algorithms to be picked up (even transposition can be blockified and improved). Anyone wanting to give me a hand on this? Please dont pick weighted ALS-WR so far, i still hope to finish porting it. There are more interesting questions there, like parameter validation and fitting. Common problem i have is that suppose you have the implicit feedback approach. Then you reformulate it in terms of preference (P) and confidence (C) inputs. The original paper speaks of a specific scheme of forming C that includes one parameter they want to fit. More interesting question is, what if we have more than one parameter? I.e. what if we have a bunch of user behavior, suppose, an item search, browse, click, add2card, and finally, aquisition. That's a whole bunch of parameters to form confidence of user's preference. I.e. it is reasonable to assume that e.g. since every transaction preceeds by add2card, add2card signifies a positive preference in general (we are just far less confident about that). Then again, abandoned cart may also signify a negative preference, or nothing at all. Anyway. suppose we want to perform exploration what's worth what. Natural way is to do it, again, thru a crossvalidation . However, since there are many parameters, the task becomes fairly less interesting. since there is not so much test data (we still should assume we will have just a handful of crossvalidation runs) various "online" convex searching techniques like SGD or BFGS are not going to be very viable. what i was thinking of, maybe we can start runnig parallel tries and fit the data into paraboloids (i.e. second degree polynomial regression without interaction terms). That might be a big assumption but that would be enough. Of course we may discover hyperbolic parabaloid properties along some parameter axes. in which case it would mean we got the preference wrong, so we flip the preference mapping. (i.e. click = (P=1, C=0.5) would flip into click = (P=0, C=0...) and re-validate again. This is kind of multidimensional variation of one-parameter second degree polynom fitting that Raphael refered to once. We are taking on a lot of assumptions here (parameter independence, existence of a good global maximum etc. etc). Perhaps there's something better to automate that search? thanks . -Dmitriy > Spark Bindings (DRM) > -------------------- > > Key: MAHOUT-1346 > URL: https://issues.apache.org/jira/browse/MAHOUT-1346 > Project: Mahout > Issue Type: Improvement > Affects Versions: 0.8 > Reporter: Dmitriy Lyubimov > Assignee: Dmitriy Lyubimov > Fix For: Backlog > > > Spark bindings for Mahout DRM. > DRM DSL. > Disclaimer. This will all be experimental at this point. > The idea is to wrap DRM by Spark RDD with support of some basic > functionality, perhaps some humble beginning of Cost-based optimizer > (0) Spark serialization support for Vector, Matrix > (1) Bagel transposition > (2) slim X'X > (2a) not-so-slim X'X > (3) blockify() (compose RDD containing vertical blocks of original input) > (4) read/write Mahout DRM off HDFS > (5) A'B > ... -- This message was sent by Atlassian JIRA (v6.1#6144)