[ 
https://issues.apache.org/jira/browse/MAHOUT-1518?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13975336#comment-13975336
 ] 

Pat Ferrel commented on MAHOUT-1518:
------------------------------------

[~ssc] +100 (how many votes do I get? :-)  To a Spark/Scala programmer 
import/export becomes trivial. I'd hope we could appeal to anyone who might use 
Solr, which has extremely broad appeal.

[~tdunning] A Web GUI is obviously a far better way to do the UI. Clearly this 
works for a lot of Solr and Elastic Search users.

I'd definitely like to help on this. Looking at Sebastian's full example with 
import and cooccurrence you can also imagine the output phase. Everything 
internal to Mahout would use in-memory and Scala APIs. If Mahout supported 
something like the IndexedDataset then doing the imports and even building 
import configs would be pretty easy indeed. Internal to Mahout things like 
transpose, cluster, cooccurrence, recommenders, could all support the data type 
with almost 0 overhead since it contains a regular old DRM and the indexes are 
only moved around or ignored except during import and export.

The IndexedDataset looks to me like a good first stab at the data frame idea. 
It's exactly what I was proposing on the dev list in the CLI discussion (except 
we need bi-directional BiMaps). How about the name IndexedDistributedMatrix 
since the row part of DRM is no longer appropriate. Supports lookup by external 
index too, slick.

> Preprocessing for collaborative filtering with the Scala DSL
> ------------------------------------------------------------
>
>                 Key: MAHOUT-1518
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-1518
>             Project: Mahout
>          Issue Type: New Feature
>          Components: Collaborative Filtering
>            Reporter: Sebastian Schelter
>            Assignee: Sebastian Schelter
>             Fix For: 1.0
>
>         Attachments: MAHOUT-1518.patch
>
>
> The aim here is to provide some easy-to-use machinery to enable the usage of 
> the new Cooccurrence Analysis code from MAHOUT-1464 with datasets represented 
> as follows in a CSV file with the schema _timestamp, userId, itemId, action_, 
> e.g.
> {code}
> timestamp1, userIdString1, itemIdString1, “view"
> timestamp2, userIdString2, itemIdString1, “like"
> {code}



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Reply via email to