[
https://issues.apache.org/jira/browse/MAHOUT-1541?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13991013#comment-13991013
]
Pat Ferrel commented on MAHOUT-1541:
------------------------------------
I have a partial implementation of the CLI driver, Importer, IndexedDataset and
some tests.
The design object is to support all Mahout V2 CLIs for blackbox type jobs
similar to the legacy CLI but allow more flexible text file import/export
maintaining user specified IDs.
Anyone interested please take a look at the github repo and its wiki here:
https://github.com/pferrel/harness/wiki
I would greatly appreciate comments. This is a very early version and was
shamelessly stolen from some examples Sebastian provided. It does actually run
the cross-cooccurrence Spark code and display example output. It reads from a
text-delimited file but there is only console output at present. Most options
are not implemented yet because I'd like to get feedback now.
> Create CLI Driver for Spark Cooccurrence Analysis
> -------------------------------------------------
>
> Key: MAHOUT-1541
> URL: https://issues.apache.org/jira/browse/MAHOUT-1541
> Project: Mahout
> Issue Type: Bug
> Components: CLI
> Reporter: Pat Ferrel
> Assignee: Pat Ferrel
>
> Create a CLI driver to import data in a flexible manner, create an
> IndexedDataset with BiMap ID translation dictionaries, call the Spark
> CooccurrenceAnalysis with the appropriate params, then write output with
> external IDs optionally reattached.
> Ultimately it should be able to read input as the legacy mr does but will
> support reading externally defined IDs and flexible formats. Output will be
> of the legacy format or text files of the user's specification with
> reattached Item IDs.
> Support for legacy formats is a question, users can always use the legacy
> code if they want this. Internal to the IndexedDataset is a Spark DRM so
> pipelining can be accomplished without any writing to an actual file so the
> legacy sequence file output may not be needed.
> Opinions?
--
This message was sent by Atlassian JIRA
(v6.2#6252)