[
https://issues.apache.org/jira/browse/MAHOUT-1541?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13989040#comment-13989040
]
Pat Ferrel edited comment on MAHOUT-1541 at 5/4/14 4:54 PM:
------------------------------------------------------------
For something as complicated as an r-like dataframe that's a good approach and
I did read it.
The sole reason for IndexedDataset in my use is import/export. You'll see the
code in my github in a few days. If the needs match I'll be happy to merge
IndexedDataset and/or this driver and import code with whatever comes out of
1490.
For now I have an actual need for this code in the solr-recommender running on
the demo site and the import/export code will have minimal impact on the
internals of IndexedDataset so I'm going with it for now only for expediency.
There is no need for slices by label or the like so there should be little
duplicated work.
The IndexedDataset is defined as:
{code:language=scala}
/**
* Comments: Wraps a Mahout DrmLike object and includes two BiMaps to store
translation
* dictionaries. This may be replaced with a Mahout DSL dataframe-like object
in the future.
* The primary use of this is for import and export, keeping track of
external IDs and
* preserving them all the way to output.
*
* Example: For a transpose job the 'matrix: DrmLike[Int]' is passed into the
DSL code
* that transposes the values, then the dictionaries are swapped and a new
* IndexedDataset is returned from the job, which will be exported to files
using
* the labels.reverse(ID: Int) thereby preserving the external ID.
*
* @param matrix DrmLike[Int], representing the distributed matrix storing the
actual data.
* @param rowLabels BiMap[String, Int] storing a bidirectional mapping of
external String ID to
* and from the ordinal Mahout Int ID. This one holds row
labels
* @param columnLabels BiMap[String, Int] storing a bidirectional mapping of
external String
* ID to and from the ordinal Mahout Int ID. This one holds
column labels
*
* @return
*/
case class IndexedDataset(matrix: DrmLike[Int], rowLabels: BiMap[String,Int],
columnLabels: BiMap[String,Int])
{code}
Note the BiMaps are actually java BiHashMaps from Guava. That will be
sufficient for my current needs.
Note that the Cooccurrence driver is a proposed template of CLI drivers in
general. The code is being designed to work for any CLI access to Mahout-Spark.
I'll have it running on the demo site solr recommender as soon as it's tested
out and before any official Mahout commit so there is plenty of time to give
opinions.
was (Author: pferrel):
For something as complicated as an r-like dataframe that's a good approach and
I did read it.
The sole reason for IndexedDataset in my use is import/export. You'll see the
code in my github in a few days. If the needs match I'll be happy to merge
IndexedDataset and/or this driver and import code with whatever comes out of
1490.
For now I have an actual need for this code in the solr-recommender running on
the demo site and the import/export code will have minimal impact on the
internals of IndexedDataset so I'm going with it for now only for expediency.
There is no need for slices by label or the like so there should be little
duplicated work.
The IndexedDataset is defined as:
{code:language=scala}
/**
* Comments: Wraps a Mahout DrmLike object and includes two BiMaps to store
translation
* dictionaries. This may be replaced with a Mahout DSL dataframe-like object
in the future.
* The primary use of this is for import and export, keeping track of
external IDs and
* preserving them all the way to output.
*
* Example: For a transpose job the 'matrix: DrmLike[Int]' is passed into the
DSL code
* that transposes the values, then the dictionaries are swapped and a new
* IndexedDataset is returned from the job, which will be exported to files
using
* the labels.reverse(ID: Int) thereby preserving the external ID.
*
* @param matrix DrmLike[Int], representing the distributed matrix storing the
actual data.
* @param rowLabels BiMap[String, Int] storing a bidirectional mapping of
external String ID to
* and from the ordinal Mahout Int ID. This one holds row
labels
* @param columnLabels BiMap[String, Int] storing a bidirectional mapping of
external String
* ID to and from the ordinal Mahout Int ID. This one holds
column labels
*
* @return
*/
case class IndexedDataset(matrix: DrmLike[Int], rowLabels: BiMap[String,Int],
columnLabels: BiMap[String,Int])
{code}
That will be sufficient for my current needs.
Note that the Cooccurrence driver is a proposed template of CLI drivers in
general. The code is being designed to work for any CLI access to Mahout-Spark.
I'll have it running on the demo site solr recommender as soon as it's tested
out and before any official Mahout commit so there is plenty of time to give
opinions.
> Create CLI Driver for Spark Cooccurrence Analysis
> -------------------------------------------------
>
> Key: MAHOUT-1541
> URL: https://issues.apache.org/jira/browse/MAHOUT-1541
> Project: Mahout
> Issue Type: Bug
> Components: CLI
> Reporter: Pat Ferrel
> Assignee: Pat Ferrel
>
> Create a CLI driver to import data in a flexible manner, create an
> IndexedDataset with BiMap ID translation dictionaries, call the Spark
> CooccurrenceAnalysis with the appropriate params, then write output with
> external IDs optionally reattached.
> Ultimately it should be able to read input as the legacy mr does but will
> support reading externally defined IDs and flexible formats. Output will be
> of the legacy format or text files of the user's specification with
> reattached Item IDs.
> Support for legacy formats is a question, users can always use the legacy
> code if they want this. Internal to the IndexedDataset is a Spark DRM so
> pipelining can be accomplished without any writing to an actual file so the
> legacy sequence file output may not be needed.
> Opinions?
--
This message was sent by Atlassian JIRA
(v6.2#6252)