Re: [jira] [Commented] (MAHOUT-1518) Preprocessing for collaborative filtering with the Scala DSL

Sebastian Schelter Mon, 28 Apr 2014 13:10:18 -0700

No, I think we need a generic dataframe, but we also need functionalityto extract matrices from such a thing and convert matrices to such athing. For the collaborative filtering code, the functionality should bepretty easy, I guess the problems start when more sophisticatedvectorization is necessary.

One of the major failures of the old codebase was that every algorithmbrought its own preprocessing, so we should work towards an integratedsolution for the new modules.

I'm collecting ideas atm :) I talked to a guy from a Berlin-basedcompany recently about the new directions of Mahout and he had someinteresting ideas for the dataframe. He said ideally he would want topoint it to HCatalog and have it directly load data from Hive (as far aspossible, I don't see us supporting nested data for example). I'mplanning to do a deeper read of the MLI paper and see how their ideaswould fit with this.


--sebastian


On 04/28/2014 10:01 PM, Dmitriy Lyubimov wrote:

[~ssc] makes sense. Is this still thought to be a stop-gap?


On Mon, Apr 28, 2014 at 12:50 PM, Sebastian Schelter (JIRA) <[email protected]

wrote:


     [
https://issues.apache.org/jira/browse/MAHOUT-1518?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13983463#comment-13983463]

Sebastian Schelter commented on MAHOUT-1518:
--------------------------------------------

I thought about this issue and I think a generic solution could work as
follows:

# We have a generic dataframe that allows you to load your CSV file and
specify a schema for that: first column has name "timestamp" and type long,
second column has name "userid" and type string, third has name "itemid"
and type string, fourth column has name "interaction" and type "string" or
some enumeraton type.
# the dataframe can be filtered by column values, so we could for example
create a new dataframe with all rows where interaction equals "view"
# we can extract a DRM from the dataframe, e.g. by specifying a
dataframe-column to use as matrix row index and a dataframe-column to use
as matrix column index, this would give us something similar to the
IndexedDataset, a DRM + plus two bidirectional dictionaries
# we feed the DRM into the cooccurrence code and retrieve the result as DRM
# we have another method that converts the result DRM back to a generic
dataframe using the bidirectional dictionary

Does that make sense?

Preprocessing for collaborative filtering with the Scala DSL
------------------------------------------------------------

                 Key: MAHOUT-1518
                 URL: https://issues.apache.org/jira/browse/MAHOUT-1518
             Project: Mahout
          Issue Type: New Feature
          Components: Collaborative Filtering
            Reporter: Sebastian Schelter
            Assignee: Sebastian Schelter
             Fix For: 1.0

         Attachments: MAHOUT-1518.patch


The aim here is to provide some easy-to-use machinery to enable the

usage of the new Cooccurrence Analysis code from MAHOUT-1464 with datasets
represented as follows in a CSV file with the schema _timestamp, userId,
itemId, action_, e.g.

{code}
timestamp1, userIdString1, itemIdString1, “view"
timestamp2, userIdString2, itemIdString1, “like"
{code}




--
This message was sent by Atlassian JIRA
(v6.2#6252)

Re: [jira] [Commented] (MAHOUT-1518) Preprocessing for collaborative filtering with the Scala DSL

Reply via email to