[ 
https://issues.apache.org/jira/browse/MAHOUT-1518?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13982011#comment-13982011
 ] 

Pat Ferrel commented on MAHOUT-1518:
------------------------------------

Given the comments here and the separate effort in MAHOUT-1490 to mimic the R 
dataframe as closely as possible it seems like IndexedDataset should be a 
wrapper that may contain (but not inherit from) DRMs or some future 
dataframe-like things. The IndexedDataset will probably be prameterized by the 
DRM-like type and the types of the things stored in the ID dictionaries. 

That way the concerns of the DRM-like thing and the IndexedDataset are kept 
separate. There will be overlap but the primary concern of IndexedDataset would 
be import/export and supplying a parameter to a Mahout 'job'. This would 
probably include iteration by row or column but leave 'head', 'tail', 'slice', 
and the matrix ops to the contained object. It would also include lookup by 
ordinal int key OR external key.

The primary use of IndexedDataset will be to create from a file an rdd backed 
thing used to perform something like the current Mahout CLI jobs do; 
ItemSimilarity, RowSimilairty, Recommenders, Transpose, Cluster, and whatever 
else makes sense. 

There will also probably be another set of classes to define the behavior of 
various source/sinks like TextDelimited, SeqFile, and probably others, maybe 
even DBs

Put the source/sink together with IndexedDataset inside a CLI and you have 
format and language agnostic use of Mahout.

This is about all the Solr-recommender example does so I'll start there since 
we now have some data. Its input is a directory structure filled with log 
files, its output is a couple CSVs for Solr indexing.

Please comment if this doesn't make sense but remember the use case, which is 
on pretty solid ground.

> Preprocessing for collaborative filtering with the Scala DSL
> ------------------------------------------------------------
>
>                 Key: MAHOUT-1518
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-1518
>             Project: Mahout
>          Issue Type: New Feature
>          Components: Collaborative Filtering
>            Reporter: Sebastian Schelter
>            Assignee: Sebastian Schelter
>             Fix For: 1.0
>
>         Attachments: MAHOUT-1518.patch
>
>
> The aim here is to provide some easy-to-use machinery to enable the usage of 
> the new Cooccurrence Analysis code from MAHOUT-1464 with datasets represented 
> as follows in a CSV file with the schema _timestamp, userId, itemId, action_, 
> e.g.
> {code}
> timestamp1, userIdString1, itemIdString1, “view"
> timestamp2, userIdString2, itemIdString1, “like"
> {code}



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Reply via email to