[
https://issues.apache.org/jira/browse/MAHOUT-1518?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13983463#comment-13983463
]
Sebastian Schelter commented on MAHOUT-1518:
--------------------------------------------
I thought about this issue and I think a generic solution could work as
follows:
# We have a generic dataframe that allows you to load your CSV file and specify
a schema for that: first column has name "timestamp" and type long, second
column has name "userid" and type string, third has name "itemid" and type
string, fourth column has name "interaction" and type "string" or some
enumeraton type.
# the dataframe can be filtered by column values, so we could for example
create a new dataframe with all rows where interaction equals "view"
# we can extract a DRM from the dataframe, e.g. by specifying a
dataframe-column to use as matrix row index and a dataframe-column to use as
matrix column index, this would give us something similar to the
IndexedDataset, a DRM + plus two bidirectional dictionaries
# we feed the DRM into the cooccurrence code and retrieve the result as DRM
# we have another method that converts the result DRM back to a generic
dataframe using the bidirectional dictionary
Does that make sense?
> Preprocessing for collaborative filtering with the Scala DSL
> ------------------------------------------------------------
>
> Key: MAHOUT-1518
> URL: https://issues.apache.org/jira/browse/MAHOUT-1518
> Project: Mahout
> Issue Type: New Feature
> Components: Collaborative Filtering
> Reporter: Sebastian Schelter
> Assignee: Sebastian Schelter
> Fix For: 1.0
>
> Attachments: MAHOUT-1518.patch
>
>
> The aim here is to provide some easy-to-use machinery to enable the usage of
> the new Cooccurrence Analysis code from MAHOUT-1464 with datasets represented
> as follows in a CSV file with the schema _timestamp, userId, itemId, action_,
> e.g.
> {code}
> timestamp1, userIdString1, itemIdString1, “view"
> timestamp2, userIdString2, itemIdString1, “like"
> {code}
--
This message was sent by Atlassian JIRA
(v6.2#6252)