[jira] [Comment Edited] (MAHOUT-1541) Create CLI Driver for Spark Cooccurrence Analysis

Pat Ferrel (JIRA) Sat, 03 May 2014 09:24:06 -0700

    [ 
https://issues.apache.org/jira/browse/MAHOUT-1541?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13988735#comment-13988735
 ]


Pat Ferrel edited comment on MAHOUT-1541 at 5/3/14 4:21 PM:
------------------------------------------------------------

Agreed (mostly), only the CLI is custom for each algo.

The preprocessor was a remnant of your old example patch and isn't meant to be 
repeated. Not planning to have separate code for every algo at all, in fact it 
should be quite the opposite. There will be a custom CLI for each algo and one 
of a couple customizable but general purpose importer/exporters (text 
delimited, sequencefile?) with some method of specifying input and output 
schema. 

The IndexedDataset would be identical in structure in all cases. Should have 
some of the IndexedDataset improvements (mostly BiMaps) today and I'm willing 
to merge them with some other dataframe in the future. 

What I am doing is exactly what we agreed to in MAHOUT-1518 There is another 
Jira about dataframes but I wasn't aware of any progress made on it. Don't want 
to "wait" I only have limited time in windows, if I wait I may get nothing 
done. And I could use this right now to rebuild the solr recommender and the 
other Mahout recommenders. This work seems at worse independant of some other 
r-like dataframe, or a best can be integrated as that solidifies.

In the meantime any suggestions about using another effort like some usable 
dataframe-ish object is fine. I had though we'd convinced ourselves that the 
needs of an r-like dataframe and an import/export IndexedDataset were too 
different. Dmitriy certainly made strong arguments to that effect.

Just using the cooccurrence analysis to have an end to end example.

BTW do we really need to support sequencefiles where the legacy code does?



was (Author: pferrel):
Agreed (mostly), only the CLI is custom for each algo.

The preprocessor was a remnants of your old example patch. Not planning to have 
separate code for every algo at all, in fact it should be quite the opposite. 
There will be a custom CLI for each algo and one of a couple customizable but 
general purpose importer/exporters (text delimited, sequencefile?) with some 
method of specifying input and output schema. 

The IndexedDataset would be identical in structure in all cases. Should have 
some of the IndexedDataset improvements (mostly BiMaps) today and I'm willing 
to merge them with some other dataframe in the future. 

What I am doing is exactly what we agreed to in MAHOUT-1518 There is another 
Jira about dataframes but I wasn't aware of any progress made on it. Don't want 
to "wait" I only have limited time in windows, if I wait I may get nothing 
done. And I could use this right now to rebuild the solr recommender and the 
other Mahout recommenders. This work seems at worse independant of some other 
r-like dataframe, or a best can be integrated as that solidifies.

In the meantime any suggestions about using another effort like some usable 
dataframe-ish object is fine. I had though we'd convinced ourselves that the 
needs of an r-like dataframe and an import/export IndexedDataset were too 
different. Dmitriy certainly made strong arguments to that effect.

Just using the cooccurrence analysis to have an end to end example.

BTW do we really need to support sequencefiles where the legacy code does?


> Create CLI Driver for Spark Cooccurrence Analysis
> -------------------------------------------------
>
>                 Key: MAHOUT-1541
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-1541
>             Project: Mahout
>          Issue Type: Bug
>          Components: CLI
>            Reporter: Pat Ferrel
>            Assignee: Pat Ferrel
>
> Create a CLI driver to import data in a flexible manner, create an 
> IndexedDataset with BiMap ID translation dictionaries, call the Spark 
> CooccurrenceAnalysis with the appropriate params, then write output with 
> external IDs optionally reattached.
> Ultimately it should be able to read input as the legacy mr does but will 
> support reading externally defined IDs and flexible formats. Output will be 
> of the legacy format or text files of the user's specification with 
> reattached Item IDs. 
> Support for legacy formats is a question, users can always use the legacy 
> code if they want this. Internal to the IndexedDataset is a Spark DRM so 
> pipelining can be accomplished without any writing to an actual file so the 
> legacy sequence file output may not be needed.
> Opinions?



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Comment Edited] (MAHOUT-1541) Create CLI Driver for Spark Cooccurrence Analysis

Reply via email to