[
https://issues.apache.org/jira/browse/MAHOUT-1346?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13820475#comment-13820475
]
Dmitriy Lyubimov edited comment on MAHOUT-1346 at 11/12/13 9:50 PM:
--------------------------------------------------------------------
https://github.com/dlyubimov/mahout-commits/tree/dev-0.9.x-scala
i started moving some things there. In particular, ALS is still not there
(still haven't hashed it out with my boss). but there some inital matrix
algorithms to be picked up (even transposition can be blockified and improved).
Anyone wanting to give me a hand on this?
Please dont pick weighted ALS-WR so far, i still hope to finish porting it.
There are more interesting questions there, like parameter validation and
fitting.
Common problem i have is that suppose you have the implicit feedback approach.
Then you reformulate it in terms of preference (P) and confidence (C) inputs.
The original paper speaks of a specific scheme of forming C that includes one
parameter they want to fit.
More interesting question is, what if we have more than one parameter? I.e.
what if we have a bunch of user behavior, suppose, an item search, browse,
click, add2card, and finally, aquisition. That's a whole bunch of parameters to
form confidence of user's preference. I.e. it is reasonable to assume that e.g.
since every transaction preceeds by add2card, add2card signifies a positive
preference in general (we are just far less confident about that). Then again,
abandoned cart may also signify a negative preference, or nothing at all.
Anyway. suppose we want to perform exploration what's worth what. Natural way
is to do it, again, thru a crossvalidation . Posing such a problem presents a
whole new look at "Big Data ML" problems. Now we are using distributed
processing not just because the input might be so big, but also because we have
a lot of parameter space exploration to do (even if the one iteration problem
is not so big). And thus produce more interesting analytical results.
However, since there are many parameters, the task becomes fairly more
interesting. since there is not so much test data (we still should assume we
will have just a handful of crossvalidation runs) various "online" convex
searching techniques like SGD or BFGS are not going to be very viable. what i
was thinking of, maybe we can start running parallel tries and fit the data
into paraboloids (i.e. second degree polynomial regression without interaction
terms). That might be a big assumption but that would be enough to get a
general sense where global maximum may be even on inputs of a fairly small
size. Of course we may discover hyperbolic parabaloid properties along some
parameter axes. in which case it would mean we got the preference wrong, so we
flip the preference mapping. (i.e. click = (P=1, C=0.5) would flip into click =
(P=0, C=0...) and re-validate again. This is kind of multidimensional
variation of one-parameter second degree polynom fitting that Raphael refered
to once.
We are taking on a lot of assumptions here (parameter independence, existence
of a good global maximum etc. etc). Perhaps there's something better to
automate that search?
thanks .
-Dmitriy
was (Author: dlyubimov):
https://github.com/dlyubimov/mahout-commits/tree/dev-0.9.x-scala
i started moving some things there. In particular, ALS is still not there
(still haven't hashed it out with my boss). but there some inital matrix
algorithms to be picked up (even transposition can be blockified and improved).
Anyone wanting to give me a hand on this?
Please dont pick weighted ALS-WR so far, i still hope to finish porting it.
There are more interesting questions there, like parameter validation and
fitting.
Common problem i have is that suppose you have the implicit feedback approach.
Then you reformulate it in terms of preference (P) and confidence (C) inputs.
The original paper speaks of a specific scheme of forming C that includes one
parameter they want to fit.
More interesting question is, what if we have more than one parameter? I.e.
what if we have a bunch of user behavior, suppose, an item search, browse,
click, add2card, and finally, aquisition. That's a whole bunch of parameters to
form confidence of user's preference. I.e. it is reasonable to assume that e.g.
since every transaction preceeds by add2card, add2card signifies a positive
preference in general (we are just far less confident about that). Then again,
abandoned cart may also signify a negative preference, or nothing at all.
Anyway. suppose we want to perform exploration what's worth what. Natural way
is to do it, again, thru a crossvalidation . Posing such a problem presents a
whole new look at "Big Data ML" problems. Now we are using distributed
processing not just because the input might be so big, but also because we have
a lot of parameter space exploration to do (even if the one iteration problem
is not so big). And thus produce more interesting analytical results.
However, since there are many parameters, the task becomes fairly less
interesting. since there is not so much test data (we still should assume we
will have just a handful of crossvalidation runs) various "online" convex
searching techniques like SGD or BFGS are not going to be very viable. what i
was thinking of, maybe we can start runnig parallel tries and fit the data into
paraboloids (i.e. second degree polynomial regression without interaction
terms). That might be a big assumption but that would be enough. Of course we
may discover hyperbolic parabaloid properties along some parameter axes. in
which case it would mean we got the preference wrong, so we flip the preference
mapping. (i.e. click = (P=1, C=0.5) would flip into click = (P=0, C=0...) and
re-validate again. This is kind of multidimensional variation of one-parameter
second degree polynom fitting that Raphael refered to once.
We are taking on a lot of assumptions here (parameter independence, existence
of a good global maximum etc. etc). Perhaps there's something better to
automate that search?
thanks .
-Dmitriy
> Spark Bindings (DRM)
> --------------------
>
> Key: MAHOUT-1346
> URL: https://issues.apache.org/jira/browse/MAHOUT-1346
> Project: Mahout
> Issue Type: Improvement
> Affects Versions: 0.8
> Reporter: Dmitriy Lyubimov
> Assignee: Dmitriy Lyubimov
> Fix For: Backlog
>
>
> Spark bindings for Mahout DRM.
> DRM DSL.
> Disclaimer. This will all be experimental at this point.
> The idea is to wrap DRM by Spark RDD with support of some basic
> functionality, perhaps some humble beginning of Cost-based optimizer
> (0) Spark serialization support for Vector, Matrix
> (1) Bagel transposition
> (2) slim X'X
> (2a) not-so-slim X'X
> (3) blockify() (compose RDD containing vertical blocks of original input)
> (4) read/write Mahout DRM off HDFS
> (5) A'B
> ...
--
This message was sent by Atlassian JIRA
(v6.1#6144)