[jira] [Comment Edited] (MAHOUT-1541) Create CLI Driver for Spark Cooccurrence Analysis

Pat Ferrel (JIRA) Sun, 04 May 2014 09:56:36 -0700

    [ 
https://issues.apache.org/jira/browse/MAHOUT-1541?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13989040#comment-13989040
 ]


Pat Ferrel edited comment on MAHOUT-1541 at 5/4/14 4:54 PM:
------------------------------------------------------------

For something as complicated as an r-like dataframe that's a good approach and 
I did read it. 

The sole reason for IndexedDataset in my use is import/export. You'll see the 
code in my github in a few days. If the needs match I'll be happy to merge 
IndexedDataset and/or this driver and import code with whatever comes out of 
1490.

For now I have an actual need for this code in the solr-recommender running on 
the demo site and the import/export code will have minimal impact on the 
internals of IndexedDataset so I'm going with it for now only for expediency. 
There is no need for slices by label or the like so there should be little 
duplicated work.

The IndexedDataset is defined as:

{code:language=scala}
/**
 * Comments: Wraps a Mahout DrmLike object and includes two BiMaps to store 
translation
 *   dictionaries. This may be replaced with a Mahout DSL dataframe-like object 
in the future.
 *   The primary use of this is for import and export, keeping track of 
external IDs and
 *   preserving them all the way to output.
 *
 * Example: For a transpose job the 'matrix: DrmLike[Int]' is passed into the 
DSL code
 *   that transposes the values, then the dictionaries are swapped and a new
 *   IndexedDataset is returned from the job, which will be exported to files 
using
 *   the labels.reverse(ID: Int) thereby preserving the external ID.
 *
 * @param matrix  DrmLike[Int], representing the distributed matrix storing the 
actual data.
 * @param rowLabels BiMap[String, Int] storing a bidirectional mapping of 
external String ID to
 *                  and from the ordinal Mahout Int ID. This one holds row 
labels
 * @param columnLabels BiMap[String, Int] storing a bidirectional mapping of 
external String
 *                  ID to and from the ordinal Mahout Int ID. This one holds 
column labels
 *
 * @return
 */

case class IndexedDataset(matrix: DrmLike[Int], rowLabels: BiMap[String,Int], 
columnLabels: BiMap[String,Int])
{code}

Note the BiMaps are actually java BiHashMaps from Guava. That will be 
sufficient for my current needs.

Note that the Cooccurrence driver is a proposed template of CLI drivers in 
general. The code is being designed to work for any CLI access to Mahout-Spark. 
I'll have it running on the demo site solr recommender as soon as it's tested 
out and before any official Mahout commit so there is plenty of time to give 
opinions.


was (Author: pferrel):
For something as complicated as an r-like dataframe that's a good approach and 
I did read it. 

The sole reason for IndexedDataset in my use is import/export. You'll see the 
code in my github in a few days. If the needs match I'll be happy to merge 
IndexedDataset and/or this driver and import code with whatever comes out of 
1490.

For now I have an actual need for this code in the solr-recommender running on 
the demo site and the import/export code will have minimal impact on the 
internals of IndexedDataset so I'm going with it for now only for expediency. 
There is no need for slices by label or the like so there should be little 
duplicated work.

The IndexedDataset is defined as:

{code:language=scala}
/**
 * Comments: Wraps a Mahout DrmLike object and includes two BiMaps to store 
translation
 *   dictionaries. This may be replaced with a Mahout DSL dataframe-like object 
in the future.
 *   The primary use of this is for import and export, keeping track of 
external IDs and
 *   preserving them all the way to output.
 *
 * Example: For a transpose job the 'matrix: DrmLike[Int]' is passed into the 
DSL code
 *   that transposes the values, then the dictionaries are swapped and a new
 *   IndexedDataset is returned from the job, which will be exported to files 
using
 *   the labels.reverse(ID: Int) thereby preserving the external ID.
 *
 * @param matrix  DrmLike[Int], representing the distributed matrix storing the 
actual data.
 * @param rowLabels BiMap[String, Int] storing a bidirectional mapping of 
external String ID to
 *                  and from the ordinal Mahout Int ID. This one holds row 
labels
 * @param columnLabels BiMap[String, Int] storing a bidirectional mapping of 
external String
 *                  ID to and from the ordinal Mahout Int ID. This one holds 
column labels
 *
 * @return
 */

case class IndexedDataset(matrix: DrmLike[Int], rowLabels: BiMap[String,Int], 
columnLabels: BiMap[String,Int])
{code}

That will be sufficient for my current needs.

Note that the Cooccurrence driver is a proposed template of CLI drivers in 
general. The code is being designed to work for any CLI access to Mahout-Spark. 
I'll have it running on the demo site solr recommender as soon as it's tested 
out and before any official Mahout commit so there is plenty of time to give 
opinions.

> Create CLI Driver for Spark Cooccurrence Analysis
> -------------------------------------------------
>
>                 Key: MAHOUT-1541
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-1541
>             Project: Mahout
>          Issue Type: Bug
>          Components: CLI
>            Reporter: Pat Ferrel
>            Assignee: Pat Ferrel
>
> Create a CLI driver to import data in a flexible manner, create an 
> IndexedDataset with BiMap ID translation dictionaries, call the Spark 
> CooccurrenceAnalysis with the appropriate params, then write output with 
> external IDs optionally reattached.
> Ultimately it should be able to read input as the legacy mr does but will 
> support reading externally defined IDs and flexible formats. Output will be 
> of the legacy format or text files of the user's specification with 
> reattached Item IDs. 
> Support for legacy formats is a question, users can always use the legacy 
> code if they want this. Internal to the IndexedDataset is a Spark DRM so 
> pipelining can be accomplished without any writing to an actual file so the 
> legacy sequence file output may not be needed.
> Opinions?



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Comment Edited] (MAHOUT-1541) Create CLI Driver for Spark Cooccurrence Analysis

Reply via email to