[ 
https://issues.apache.org/jira/browse/MAHOUT-1518?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13975164#comment-13975164
 ] 

Ted Dunning commented on MAHOUT-1518:
-------------------------------------

I would be dubious of file formatting utilities.  That is what Hive and Pig and 
Impala and Drill and Tajo and so many other projects already do.

What the new stuff needs is code that sucks in a few standard formats into an 
in-memory structure that is like a data frame in R (some of this is already 
there).  

Processing programs should deal with those data structures and mostly should 
not read files at all.

The preferred path for users with a new format would be to simply read their 
data into memory.  They might adapt the Mahout input code or reformat their 
on-disk data using a tool dedicated to the purpose.  But I really think that 
Mahout shouldn't be in the business of transforming data files at all.

> Preprocessing for collaborative filtering with the Scala DSL
> ------------------------------------------------------------
>
>                 Key: MAHOUT-1518
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-1518
>             Project: Mahout
>          Issue Type: New Feature
>          Components: Collaborative Filtering
>            Reporter: Sebastian Schelter
>            Assignee: Sebastian Schelter
>             Fix For: 1.0
>
>         Attachments: MAHOUT-1518.patch
>
>
> The aim here is to provide some easy-to-use machinery to enable the usage of 
> the new Cooccurrence Analysis code from MAHOUT-1464 with datasets represented 
> as follows in a CSV file with the schema _timestamp, userId, itemId, action_, 
> e.g.
> {code}
> timestamp1, userIdString1, itemIdString1, “view"
> timestamp2, userIdString2, itemIdString1, “like"
> {code}



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Reply via email to