No, I think we need a generic dataframe, but we also need functionality to extract matrices from such a thing and convert matrices to such a thing. For the collaborative filtering code, the functionality should be pretty easy, I guess the problems start when more sophisticated vectorization is necessary.

One of the major failures of the old codebase was that every algorithm brought its own preprocessing, so we should work towards an integrated solution for the new modules.

I'm collecting ideas atm :) I talked to a guy from a Berlin-based company recently about the new directions of Mahout and he had some interesting ideas for the dataframe. He said ideally he would want to point it to HCatalog and have it directly load data from Hive (as far as possible, I don't see us supporting nested data for example). I'm planning to do a deeper read of the MLI paper and see how their ideas would fit with this.

--sebastian


On 04/28/2014 10:01 PM, Dmitriy Lyubimov wrote:
[~ssc] makes sense. Is this still thought to be a stop-gap?


On Mon, Apr 28, 2014 at 12:50 PM, Sebastian Schelter (JIRA) <[email protected]
wrote:


     [
https://issues.apache.org/jira/browse/MAHOUT-1518?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13983463#comment-13983463]

Sebastian Schelter commented on MAHOUT-1518:
--------------------------------------------

I thought about this issue and I think a generic solution could work as
follows:

# We have a generic dataframe that allows you to load your CSV file and
specify a schema for that: first column has name "timestamp" and type long,
second column has name "userid" and type string, third has name "itemid"
and type string, fourth column has name "interaction" and type "string" or
some enumeraton type.
# the dataframe can be filtered by column values, so we could for example
create a new dataframe with all rows where interaction equals "view"
# we can extract a DRM from the dataframe, e.g. by specifying a
dataframe-column to use as matrix row index and a dataframe-column to use
as matrix column index, this would give us something similar to the
IndexedDataset, a DRM + plus two bidirectional dictionaries
# we feed the DRM into the cooccurrence code and retrieve the result as DRM
# we have another method that converts the result DRM back to a generic
dataframe using the bidirectional dictionary

Does that make sense?

Preprocessing for collaborative filtering with the Scala DSL
------------------------------------------------------------

                 Key: MAHOUT-1518
                 URL: https://issues.apache.org/jira/browse/MAHOUT-1518
             Project: Mahout
          Issue Type: New Feature
          Components: Collaborative Filtering
            Reporter: Sebastian Schelter
            Assignee: Sebastian Schelter
             Fix For: 1.0

         Attachments: MAHOUT-1518.patch


The aim here is to provide some easy-to-use machinery to enable the
usage of the new Cooccurrence Analysis code from MAHOUT-1464 with datasets
represented as follows in a CSV file with the schema _timestamp, userId,
itemId, action_, e.g.
{code}
timestamp1, userIdString1, itemIdString1, “view"
timestamp2, userIdString2, itemIdString1, “like"
{code}



--
This message was sent by Atlassian JIRA
(v6.2#6252)



Reply via email to