[
https://issues.apache.org/jira/browse/MAHOUT-1518?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13982011#comment-13982011
]
Pat Ferrel commented on MAHOUT-1518:
------------------------------------
Given the comments here and the separate effort in MAHOUT-1490 to mimic the R
dataframe as closely as possible it seems like IndexedDataset should be a
wrapper that may contain (but not inherit from) DRMs or some future
dataframe-like things. The IndexedDataset will probably be prameterized by the
DRM-like type and the types of the things stored in the ID dictionaries.
That way the concerns of the DRM-like thing and the IndexedDataset are kept
separate. There will be overlap but the primary concern of IndexedDataset would
be import/export and supplying a parameter to a Mahout 'job'. This would
probably include iteration by row or column but leave 'head', 'tail', 'slice',
and the matrix ops to the contained object. It would also include lookup by
ordinal int key OR external key.
The primary use of IndexedDataset will be to create from a file an rdd backed
thing used to perform something like the current Mahout CLI jobs do;
ItemSimilarity, RowSimilairty, Recommenders, Transpose, Cluster, and whatever
else makes sense.
There will also probably be another set of classes to define the behavior of
various source/sinks like TextDelimited, SeqFile, and probably others, maybe
even DBs
Put the source/sink together with IndexedDataset inside a CLI and you have
format and language agnostic use of Mahout.
This is about all the Solr-recommender example does so I'll start there since
we now have some data. Its input is a directory structure filled with log
files, its output is a couple CSVs for Solr indexing.
Please comment if this doesn't make sense but remember the use case, which is
on pretty solid ground.
> Preprocessing for collaborative filtering with the Scala DSL
> ------------------------------------------------------------
>
> Key: MAHOUT-1518
> URL: https://issues.apache.org/jira/browse/MAHOUT-1518
> Project: Mahout
> Issue Type: New Feature
> Components: Collaborative Filtering
> Reporter: Sebastian Schelter
> Assignee: Sebastian Schelter
> Fix For: 1.0
>
> Attachments: MAHOUT-1518.patch
>
>
> The aim here is to provide some easy-to-use machinery to enable the usage of
> the new Cooccurrence Analysis code from MAHOUT-1464 with datasets represented
> as follows in a CSV file with the schema _timestamp, userId, itemId, action_,
> e.g.
> {code}
> timestamp1, userIdString1, itemIdString1, “view"
> timestamp2, userIdString2, itemIdString1, “like"
> {code}
--
This message was sent by Atlassian JIRA
(v6.2#6252)