[~ssc] makes sense. Is this still thought to be a stop-gap?
On Mon, Apr 28, 2014 at 12:50 PM, Sebastian Schelter (JIRA) <[email protected] > wrote: > > [ > https://issues.apache.org/jira/browse/MAHOUT-1518?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13983463#comment-13983463] > > Sebastian Schelter commented on MAHOUT-1518: > -------------------------------------------- > > I thought about this issue and I think a generic solution could work as > follows: > > # We have a generic dataframe that allows you to load your CSV file and > specify a schema for that: first column has name "timestamp" and type long, > second column has name "userid" and type string, third has name "itemid" > and type string, fourth column has name "interaction" and type "string" or > some enumeraton type. > # the dataframe can be filtered by column values, so we could for example > create a new dataframe with all rows where interaction equals "view" > # we can extract a DRM from the dataframe, e.g. by specifying a > dataframe-column to use as matrix row index and a dataframe-column to use > as matrix column index, this would give us something similar to the > IndexedDataset, a DRM + plus two bidirectional dictionaries > # we feed the DRM into the cooccurrence code and retrieve the result as DRM > # we have another method that converts the result DRM back to a generic > dataframe using the bidirectional dictionary > > Does that make sense? > > > Preprocessing for collaborative filtering with the Scala DSL > > ------------------------------------------------------------ > > > > Key: MAHOUT-1518 > > URL: https://issues.apache.org/jira/browse/MAHOUT-1518 > > Project: Mahout > > Issue Type: New Feature > > Components: Collaborative Filtering > > Reporter: Sebastian Schelter > > Assignee: Sebastian Schelter > > Fix For: 1.0 > > > > Attachments: MAHOUT-1518.patch > > > > > > The aim here is to provide some easy-to-use machinery to enable the > usage of the new Cooccurrence Analysis code from MAHOUT-1464 with datasets > represented as follows in a CSV file with the schema _timestamp, userId, > itemId, action_, e.g. > > {code} > > timestamp1, userIdString1, itemIdString1, “view" > > timestamp2, userIdString2, itemIdString1, “like" > > {code} > > > > -- > This message was sent by Atlassian JIRA > (v6.2#6252) >
