[
https://issues.apache.org/jira/browse/MAHOUT-1518?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13975336#comment-13975336
]
Pat Ferrel commented on MAHOUT-1518:
------------------------------------
[~ssc] +100 (how many votes do I get? :-) To a Spark/Scala programmer
import/export becomes trivial. I'd hope we could appeal to anyone who might use
Solr, which has extremely broad appeal.
[~tdunning] A Web GUI is obviously a far better way to do the UI. Clearly this
works for a lot of Solr and Elastic Search users.
I'd definitely like to help on this. Looking at Sebastian's full example with
import and cooccurrence you can also imagine the output phase. Everything
internal to Mahout would use in-memory and Scala APIs. If Mahout supported
something like the IndexedDataset then doing the imports and even building
import configs would be pretty easy indeed. Internal to Mahout things like
transpose, cluster, cooccurrence, recommenders, could all support the data type
with almost 0 overhead since it contains a regular old DRM and the indexes are
only moved around or ignored except during import and export.
The IndexedDataset looks to me like a good first stab at the data frame idea.
It's exactly what I was proposing on the dev list in the CLI discussion (except
we need bi-directional BiMaps). How about the name IndexedDistributedMatrix
since the row part of DRM is no longer appropriate. Supports lookup by external
index too, slick.
> Preprocessing for collaborative filtering with the Scala DSL
> ------------------------------------------------------------
>
> Key: MAHOUT-1518
> URL: https://issues.apache.org/jira/browse/MAHOUT-1518
> Project: Mahout
> Issue Type: New Feature
> Components: Collaborative Filtering
> Reporter: Sebastian Schelter
> Assignee: Sebastian Schelter
> Fix For: 1.0
>
> Attachments: MAHOUT-1518.patch
>
>
> The aim here is to provide some easy-to-use machinery to enable the usage of
> the new Cooccurrence Analysis code from MAHOUT-1464 with datasets represented
> as follows in a CSV file with the schema _timestamp, userId, itemId, action_,
> e.g.
> {code}
> timestamp1, userIdString1, itemIdString1, “view"
> timestamp2, userIdString2, itemIdString1, “like"
> {code}
--
This message was sent by Atlassian JIRA
(v6.2#6252)