[
https://issues.apache.org/jira/browse/MAHOUT-1541?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13988916#comment-13988916
]
Saikat Kanjilal commented on MAHOUT-1541:
-----------------------------------------
Pat,
Just one comment on the "no-progress around the dataframes JIRA", I assume you
are referring to 1490, there is indeed quite a bit of progress presenting APIs
around a set of generic operations around a dataFrame, based on Dmitry's
recommendation I took the path of creating a proposal rather than blasting off
and writing code to do this and have that be heavily criticized and not meeting
the committable expectations, this way the design will be in place and have
general consensus before any coding efforts begin, I'd love to get feedback
from you and others to move 1490 along, please see blog and comment on JIRA if
you'd like.
Regards
> Create CLI Driver for Spark Cooccurrence Analysis
> -------------------------------------------------
>
> Key: MAHOUT-1541
> URL: https://issues.apache.org/jira/browse/MAHOUT-1541
> Project: Mahout
> Issue Type: Bug
> Components: CLI
> Reporter: Pat Ferrel
> Assignee: Pat Ferrel
>
> Create a CLI driver to import data in a flexible manner, create an
> IndexedDataset with BiMap ID translation dictionaries, call the Spark
> CooccurrenceAnalysis with the appropriate params, then write output with
> external IDs optionally reattached.
> Ultimately it should be able to read input as the legacy mr does but will
> support reading externally defined IDs and flexible formats. Output will be
> of the legacy format or text files of the user's specification with
> reattached Item IDs.
> Support for legacy formats is a question, users can always use the legacy
> code if they want this. Internal to the IndexedDataset is a Spark DRM so
> pipelining can be accomplished without any writing to an actual file so the
> legacy sequence file output may not be needed.
> Opinions?
--
This message was sent by Atlassian JIRA
(v6.2#6252)