[ 
https://issues.apache.org/jira/browse/MAHOUT-1518?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13977877#comment-13977877
 ] 

Sebastian Schelter commented on MAHOUT-1518:
--------------------------------------------

[~andrew.musselman] I'll try to give a short summary. The basic problem is that 
a lot of our algorithms expect the input to be in a nice vectorized format. The 
cooccurrence analysis for example expects the input to be DRMs. Many users will 
have data with string keys for example, want to run an algorithm on that and 
also want to have the results keyed by their string ids again.

The question is now how to achieve that. One way would be to burden this on the 
users and have them do all the conversion themselves. Obviously this is very 
bad and prevent a lot of people from using Mahout. So ideally, we give the 
users some easy-to-use machinery to help them with converting their data. This 
is a super important point, I've seen colleagues from my office pick up Mahout 
and spending much more time in converting data than in actually analyzing it. 
IIRC correctly, one of the features that Sean announced when he started his 
Myrrix system was also the ability to seamlessly use string identifiers.

Pat wrote some "wishes" on how he would like to be able to use the cooccurrence 
recommenders and in this jira, I wrote some custom conversion code in Spark 
that parses his proposed input format and creates DRMs and dictionaries from 
that. Ted then argued correctly that it is not a good design to have custom 
inputformats and preprocessing for every algorithm. I completely agree with 
that, as this is one of the reasons why the MR codebase is so hard to use. The 
discussion is now evolving that we need something like a dataframe, a 
datastructure that is able to hold arbitrary typed data and offers a limited 
set of manipulation primitives. The open question is still how to "marry" this 
with the DRMs.


> Preprocessing for collaborative filtering with the Scala DSL
> ------------------------------------------------------------
>
>                 Key: MAHOUT-1518
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-1518
>             Project: Mahout
>          Issue Type: New Feature
>          Components: Collaborative Filtering
>            Reporter: Sebastian Schelter
>            Assignee: Sebastian Schelter
>             Fix For: 1.0
>
>         Attachments: MAHOUT-1518.patch
>
>
> The aim here is to provide some easy-to-use machinery to enable the usage of 
> the new Cooccurrence Analysis code from MAHOUT-1464 with datasets represented 
> as follows in a CSV file with the schema _timestamp, userId, itemId, action_, 
> e.g.
> {code}
> timestamp1, userIdString1, itemIdString1, “view"
> timestamp2, userIdString2, itemIdString1, “like"
> {code}



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Reply via email to