[
https://issues.apache.org/jira/browse/MAHOUT-1518?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13977877#comment-13977877
]
Sebastian Schelter commented on MAHOUT-1518:
--------------------------------------------
[~andrew.musselman] I'll try to give a short summary. The basic problem is that
a lot of our algorithms expect the input to be in a nice vectorized format. The
cooccurrence analysis for example expects the input to be DRMs. Many users will
have data with string keys for example, want to run an algorithm on that and
also want to have the results keyed by their string ids again.
The question is now how to achieve that. One way would be to burden this on the
users and have them do all the conversion themselves. Obviously this is very
bad and prevent a lot of people from using Mahout. So ideally, we give the
users some easy-to-use machinery to help them with converting their data. This
is a super important point, I've seen colleagues from my office pick up Mahout
and spending much more time in converting data than in actually analyzing it.
IIRC correctly, one of the features that Sean announced when he started his
Myrrix system was also the ability to seamlessly use string identifiers.
Pat wrote some "wishes" on how he would like to be able to use the cooccurrence
recommenders and in this jira, I wrote some custom conversion code in Spark
that parses his proposed input format and creates DRMs and dictionaries from
that. Ted then argued correctly that it is not a good design to have custom
inputformats and preprocessing for every algorithm. I completely agree with
that, as this is one of the reasons why the MR codebase is so hard to use. The
discussion is now evolving that we need something like a dataframe, a
datastructure that is able to hold arbitrary typed data and offers a limited
set of manipulation primitives. The open question is still how to "marry" this
with the DRMs.
> Preprocessing for collaborative filtering with the Scala DSL
> ------------------------------------------------------------
>
> Key: MAHOUT-1518
> URL: https://issues.apache.org/jira/browse/MAHOUT-1518
> Project: Mahout
> Issue Type: New Feature
> Components: Collaborative Filtering
> Reporter: Sebastian Schelter
> Assignee: Sebastian Schelter
> Fix For: 1.0
>
> Attachments: MAHOUT-1518.patch
>
>
> The aim here is to provide some easy-to-use machinery to enable the usage of
> the new Cooccurrence Analysis code from MAHOUT-1464 with datasets represented
> as follows in a CSV file with the schema _timestamp, userId, itemId, action_,
> e.g.
> {code}
> timestamp1, userIdString1, itemIdString1, “view"
> timestamp2, userIdString2, itemIdString1, “like"
> {code}
--
This message was sent by Atlassian JIRA
(v6.2#6252)