[ 
https://issues.apache.org/jira/browse/MAHOUT-1518?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13983463#comment-13983463
 ] 

Sebastian Schelter commented on MAHOUT-1518:
--------------------------------------------

I thought about this issue and I think a generic solution could work as 
follows: 

# We have a generic dataframe that allows you to load your CSV file and specify 
a schema for that: first column has name "timestamp" and type long, second 
column has name "userid" and type string, third has name "itemid" and type 
string, fourth column has name "interaction" and type "string" or some 
enumeraton type.
# the dataframe can be filtered by column values, so we could for example 
create a new dataframe with all rows where interaction equals "view"
# we can extract a DRM from the dataframe, e.g. by specifying a 
dataframe-column to use as matrix row index and a dataframe-column to use as 
matrix column index, this would give us something similar to the 
IndexedDataset, a DRM + plus two bidirectional dictionaries 
# we feed the DRM into the cooccurrence code and retrieve the result as DRM
# we have another method that converts the result DRM back to a generic 
dataframe using the bidirectional dictionary

Does that make sense?

> Preprocessing for collaborative filtering with the Scala DSL
> ------------------------------------------------------------
>
>                 Key: MAHOUT-1518
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-1518
>             Project: Mahout
>          Issue Type: New Feature
>          Components: Collaborative Filtering
>            Reporter: Sebastian Schelter
>            Assignee: Sebastian Schelter
>             Fix For: 1.0
>
>         Attachments: MAHOUT-1518.patch
>
>
> The aim here is to provide some easy-to-use machinery to enable the usage of 
> the new Cooccurrence Analysis code from MAHOUT-1464 with datasets represented 
> as follows in a CSV file with the schema _timestamp, userId, itemId, action_, 
> e.g.
> {code}
> timestamp1, userIdString1, itemIdString1, “view"
> timestamp2, userIdString2, itemIdString1, “like"
> {code}



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Reply via email to