[jira] [Commented] (MAHOUT-1541) Create CLI Driver for Spark Cooccurrence Analysis

Saikat Kanjilal (JIRA) Sat, 03 May 2014 22:23:20 -0700

    [ 
https://issues.apache.org/jira/browse/MAHOUT-1541?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13988916#comment-13988916
 ]


Saikat Kanjilal commented on MAHOUT-1541:
-----------------------------------------

Pat,
Just one comment on the "no-progress around the dataframes JIRA", I assume you 
are referring to 1490, there is indeed quite a bit of progress presenting APIs 
around a set of generic operations around a dataFrame, based on Dmitry's 
recommendation I took the path of creating a proposal rather than blasting off 
and writing code to do this and have that be heavily criticized and not meeting 
the committable expectations, this way the design will be in place and have 
general consensus before any coding efforts begin, I'd love to get feedback 
from you and others to move 1490 along, please see blog and comment on JIRA if 
you'd like.

Regards

> Create CLI Driver for Spark Cooccurrence Analysis
> -------------------------------------------------
>
>                 Key: MAHOUT-1541
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-1541
>             Project: Mahout
>          Issue Type: Bug
>          Components: CLI
>            Reporter: Pat Ferrel
>            Assignee: Pat Ferrel
>
> Create a CLI driver to import data in a flexible manner, create an 
> IndexedDataset with BiMap ID translation dictionaries, call the Spark 
> CooccurrenceAnalysis with the appropriate params, then write output with 
> external IDs optionally reattached.
> Ultimately it should be able to read input as the legacy mr does but will 
> support reading externally defined IDs and flexible formats. Output will be 
> of the legacy format or text files of the user's specification with 
> reattached Item IDs. 
> Support for legacy formats is a question, users can always use the legacy 
> code if they want this. Internal to the IndexedDataset is a Spark DRM so 
> pipelining can be accomplished without any writing to an actual file so the 
> legacy sequence file output may not be needed.
> Opinions?



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (MAHOUT-1541) Create CLI Driver for Spark Cooccurrence Analysis

Reply via email to