[
https://issues.apache.org/jira/browse/MAHOUT-1518?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13983565#comment-13983565
]
Pat Ferrel commented on MAHOUT-1518:
------------------------------------
Yes. I did some of this in the Solr-recommender but everything was file backed,
no need with Spark.
#1: Not sure a whole schema is needed for a text delimited file, you only need
the data you'll extract, right? Maybe you are saying that some fields should be
extracted and preserved to output even though they will not be needed in the
actual DRM? In which case I'd say interesting idea. The problem with allowing
fields that are not in the DRM is where to put the data. For instance the
timestamp might be nice to preserve through to the output of something like
clustering but this would potentially be a lot of data in memory kept in an RDD
but not in the actual DRM. Although it could be done not sure it's something I
want to do in the first cut.
#2 yes
#3 yes, that's what I was planning
#4 exactly so
#5 ditto
I was thinking:
# there would be some schema class. Cascading calls these Fields in a Scheme
and has an impenetrable algebra for dealing with the flow of fields. Not
planning to imitate that, it should be simple since we are primarily talking
import/export not pipelines.
# there would be an object that describes the input type. At first this might
just specify which of the Spark file types to use but there are some limitation
with them.
# there would be a location URI for input and output.
# there would be a traversal spec. In the Solr-recommender I allow a single
file, or dir of files, or recursively walking a tree matching a regex to file
names. The last two methods create lists of URIs. This would allow people to
point Mahout at the log dir or any dir structure that the user already has even
a hadoop generated dir of part-xxxx files. Unfortunately Spark seems to only
support a URI with wildcards for multiple input files so I'm still
investigating that.
# there would be an output location URI, schema, and file type description. The
schema fields would have to match the input ones or map to generated data
fields. The type is text delimited, sequence file, etc. these would map to
supported Spark file types.
I was hoping to ultimately allow for reading and writing from/to databases
also. This is a can of worms but there are already some ways to do this without
Mahout code knowing about the DB, like mongo-hadoop, which allows mongo to look
like HDFS files. Anyway only thinking about this now not implementing. What I
did in the Solr-recommender is tack another job after the output to blast load
the DB in parallel with DB specific code so a user could do this themselves if
they want. Coming up with a generic way to do this is worth thinking about
though.
> Preprocessing for collaborative filtering with the Scala DSL
> ------------------------------------------------------------
>
> Key: MAHOUT-1518
> URL: https://issues.apache.org/jira/browse/MAHOUT-1518
> Project: Mahout
> Issue Type: New Feature
> Components: Collaborative Filtering
> Reporter: Sebastian Schelter
> Assignee: Sebastian Schelter
> Fix For: 1.0
>
> Attachments: MAHOUT-1518.patch
>
>
> The aim here is to provide some easy-to-use machinery to enable the usage of
> the new Cooccurrence Analysis code from MAHOUT-1464 with datasets represented
> as follows in a CSV file with the schema _timestamp, userId, itemId, action_,
> e.g.
> {code}
> timestamp1, userIdString1, itemIdString1, “view"
> timestamp2, userIdString2, itemIdString1, “like"
> {code}
--
This message was sent by Atlassian JIRA
(v6.2#6252)