[ 
https://issues.apache.org/jira/browse/MAHOUT-1518?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13975177#comment-13975177
 ] 

Pat Ferrel commented on MAHOUT-1518:
------------------------------------

[~ssc] Yes, this is really excellent. It has most of the the pieces necessary. 
Some further generalizations might be good, for instance I think a BiMap 
(guava) is needed ultimately since the index lookup will be in both directions, 
for input and later output.

I had in mind adding a few things like a CLI and some kind of DSL for 
describing input and output. The DSL could be generated by an interactive tool 
that would ask the user a few questions. The DSL could be created by hand or 
bypassed completely by passing in the correct CLI options. So it works a little 
like property XML files but with an interactive generator. I wouldn't use XML, 
of course, probably some simple easily readable Scala definitions.

This would allow almost any user to easily integrate their data, without 
learning Scala, Spark, SequenceFiles, etc. and would allow them to fit into 
their existing dataflow and tools.

[~tdunning] "Processing programs should deal with those data structures and 
mostly should not read files at all." Well _someone_ has to deal with files and 
this example does what you asked for. it reads them into in-memory Spark 
constructs and then communicates with the rest of Mahout using them and the 
Scala API. I don't see your issue.

This is exactly what is missing from Mahout. Whether it belongs in Mahout or as 
a wrapper project can be debated. Machine Learning will become as ubiquitous as 
Search and target the same level of users without requiring them to preprocess 
their data by learning and writing Pig or Hive code or writing in an R-Like 
Scala dialect. The first scalable ML project to have or attract support for 
this will be the most widely adopted--like Solr and ElasticSearch.

> Preprocessing for collaborative filtering with the Scala DSL
> ------------------------------------------------------------
>
>                 Key: MAHOUT-1518
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-1518
>             Project: Mahout
>          Issue Type: New Feature
>          Components: Collaborative Filtering
>            Reporter: Sebastian Schelter
>            Assignee: Sebastian Schelter
>             Fix For: 1.0
>
>         Attachments: MAHOUT-1518.patch
>
>
> The aim here is to provide some easy-to-use machinery to enable the usage of 
> the new Cooccurrence Analysis code from MAHOUT-1464 with datasets represented 
> as follows in a CSV file with the schema _timestamp, userId, itemId, action_, 
> e.g.
> {code}
> timestamp1, userIdString1, itemIdString1, “view"
> timestamp2, userIdString2, itemIdString1, “like"
> {code}



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Reply via email to