[ 
https://issues.apache.org/jira/browse/MAHOUT-1518?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13983565#comment-13983565
 ] 

Pat Ferrel commented on MAHOUT-1518:
------------------------------------

Yes. I did some of this in the Solr-recommender but everything was file backed, 
no need with Spark.

#1: Not sure a whole schema is needed for a text delimited file, you only need 
the data you'll extract, right? Maybe you are saying that some fields should be 
extracted and preserved to output even though they will not be needed in the 
actual DRM? In which case I'd say interesting idea. The problem with allowing 
fields that are not in the DRM is where to put the data. For instance the 
timestamp might be nice to preserve through to the output of something like 
clustering but this would potentially be a lot of data in memory kept in an RDD 
but not in the actual DRM. Although it could be done not sure it's something I 
want to do in the first cut.
#2 yes
#3 yes, that's what I was planning
#4 exactly so
#5 ditto 

I was thinking: 
# there would be some schema class. Cascading calls these Fields in a Scheme 
and has an impenetrable algebra for dealing with the flow of fields. Not 
planning to imitate that, it should be simple since we are primarily talking 
import/export not pipelines. 
# there would be an object that describes the input type.  At first this might 
just specify which of the Spark file types to use but there are some limitation 
with them. 
# there would be a location URI for input and output.
# there would be a traversal spec. In the Solr-recommender I allow a single 
file, or dir of files, or recursively walking a tree matching a regex to file 
names. The last two methods create lists of URIs. This would allow people to 
point Mahout at the log dir or any dir structure that the user already has even 
a hadoop generated dir of part-xxxx files. Unfortunately Spark seems to only 
support a URI with wildcards for multiple input files so I'm still 
investigating that.
# there would be an output location URI, schema, and file type description. The 
schema fields would have to match the input ones or map to generated data 
fields. The type is text delimited, sequence file, etc. these would map to 
supported Spark file types.

I was hoping to ultimately allow for reading and writing from/to databases 
also. This is a can of worms but there are already some ways to do this without 
Mahout code knowing about the DB, like mongo-hadoop, which allows mongo to look 
like HDFS files. Anyway only thinking about this now not implementing. What I 
did in the Solr-recommender is tack another job after the output to blast load 
the DB in parallel with DB specific code so a user could do this themselves if 
they want. Coming up with a generic way to do this is worth thinking about 
though.


> Preprocessing for collaborative filtering with the Scala DSL
> ------------------------------------------------------------
>
>                 Key: MAHOUT-1518
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-1518
>             Project: Mahout
>          Issue Type: New Feature
>          Components: Collaborative Filtering
>            Reporter: Sebastian Schelter
>            Assignee: Sebastian Schelter
>             Fix For: 1.0
>
>         Attachments: MAHOUT-1518.patch
>
>
> The aim here is to provide some easy-to-use machinery to enable the usage of 
> the new Cooccurrence Analysis code from MAHOUT-1464 with datasets represented 
> as follows in a CSV file with the schema _timestamp, userId, itemId, action_, 
> e.g.
> {code}
> timestamp1, userIdString1, itemIdString1, “view"
> timestamp2, userIdString2, itemIdString1, “like"
> {code}



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Reply via email to