[
https://issues.apache.org/jira/browse/MAHOUT-1541?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13988735#comment-13988735
]
Pat Ferrel edited comment on MAHOUT-1541 at 5/3/14 4:21 PM:
------------------------------------------------------------
Agreed (mostly), only the CLI is custom for each algo.
The preprocessor was a remnant of your old example patch and isn't meant to be
repeated. Not planning to have separate code for every algo at all, in fact it
should be quite the opposite. There will be a custom CLI for each algo and one
of a couple customizable but general purpose importer/exporters (text
delimited, sequencefile?) with some method of specifying input and output
schema.
The IndexedDataset would be identical in structure in all cases. Should have
some of the IndexedDataset improvements (mostly BiMaps) today and I'm willing
to merge them with some other dataframe in the future.
What I am doing is exactly what we agreed to in MAHOUT-1518 There is another
Jira about dataframes but I wasn't aware of any progress made on it. Don't want
to "wait" I only have limited time in windows, if I wait I may get nothing
done. And I could use this right now to rebuild the solr recommender and the
other Mahout recommenders. This work seems at worse independant of some other
r-like dataframe, or a best can be integrated as that solidifies.
In the meantime any suggestions about using another effort like some usable
dataframe-ish object is fine. I had though we'd convinced ourselves that the
needs of an r-like dataframe and an import/export IndexedDataset were too
different. Dmitriy certainly made strong arguments to that effect.
Just using the cooccurrence analysis to have an end to end example.
BTW do we really need to support sequencefiles where the legacy code does?
was (Author: pferrel):
Agreed (mostly), only the CLI is custom for each algo.
The preprocessor was a remnants of your old example patch. Not planning to have
separate code for every algo at all, in fact it should be quite the opposite.
There will be a custom CLI for each algo and one of a couple customizable but
general purpose importer/exporters (text delimited, sequencefile?) with some
method of specifying input and output schema.
The IndexedDataset would be identical in structure in all cases. Should have
some of the IndexedDataset improvements (mostly BiMaps) today and I'm willing
to merge them with some other dataframe in the future.
What I am doing is exactly what we agreed to in MAHOUT-1518 There is another
Jira about dataframes but I wasn't aware of any progress made on it. Don't want
to "wait" I only have limited time in windows, if I wait I may get nothing
done. And I could use this right now to rebuild the solr recommender and the
other Mahout recommenders. This work seems at worse independant of some other
r-like dataframe, or a best can be integrated as that solidifies.
In the meantime any suggestions about using another effort like some usable
dataframe-ish object is fine. I had though we'd convinced ourselves that the
needs of an r-like dataframe and an import/export IndexedDataset were too
different. Dmitriy certainly made strong arguments to that effect.
Just using the cooccurrence analysis to have an end to end example.
BTW do we really need to support sequencefiles where the legacy code does?
> Create CLI Driver for Spark Cooccurrence Analysis
> -------------------------------------------------
>
> Key: MAHOUT-1541
> URL: https://issues.apache.org/jira/browse/MAHOUT-1541
> Project: Mahout
> Issue Type: Bug
> Components: CLI
> Reporter: Pat Ferrel
> Assignee: Pat Ferrel
>
> Create a CLI driver to import data in a flexible manner, create an
> IndexedDataset with BiMap ID translation dictionaries, call the Spark
> CooccurrenceAnalysis with the appropriate params, then write output with
> external IDs optionally reattached.
> Ultimately it should be able to read input as the legacy mr does but will
> support reading externally defined IDs and flexible formats. Output will be
> of the legacy format or text files of the user's specification with
> reattached Item IDs.
> Support for legacy formats is a question, users can always use the legacy
> code if they want this. Internal to the IndexedDataset is a Spark DRM so
> pipelining can be accomplished without any writing to an actual file so the
> legacy sequence file output may not be needed.
> Opinions?
--
This message was sent by Atlassian JIRA
(v6.2#6252)