[
https://issues.apache.org/jira/browse/MAHOUT-1518?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13975164#comment-13975164
]
Ted Dunning commented on MAHOUT-1518:
-------------------------------------
I would be dubious of file formatting utilities. That is what Hive and Pig and
Impala and Drill and Tajo and so many other projects already do.
What the new stuff needs is code that sucks in a few standard formats into an
in-memory structure that is like a data frame in R (some of this is already
there).
Processing programs should deal with those data structures and mostly should
not read files at all.
The preferred path for users with a new format would be to simply read their
data into memory. They might adapt the Mahout input code or reformat their
on-disk data using a tool dedicated to the purpose. But I really think that
Mahout shouldn't be in the business of transforming data files at all.
> Preprocessing for collaborative filtering with the Scala DSL
> ------------------------------------------------------------
>
> Key: MAHOUT-1518
> URL: https://issues.apache.org/jira/browse/MAHOUT-1518
> Project: Mahout
> Issue Type: New Feature
> Components: Collaborative Filtering
> Reporter: Sebastian Schelter
> Assignee: Sebastian Schelter
> Fix For: 1.0
>
> Attachments: MAHOUT-1518.patch
>
>
> The aim here is to provide some easy-to-use machinery to enable the usage of
> the new Cooccurrence Analysis code from MAHOUT-1464 with datasets represented
> as follows in a CSV file with the schema _timestamp, userId, itemId, action_,
> e.g.
> {code}
> timestamp1, userIdString1, itemIdString1, “view"
> timestamp2, userIdString2, itemIdString1, “like"
> {code}
--
This message was sent by Atlassian JIRA
(v6.2#6252)