[
https://issues.apache.org/jira/browse/MAHOUT-1663?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14389283#comment-14389283
]
Suneel Marthi commented on MAHOUT-1663:
---------------------------------------
Its fine with porting the existing seq2sparse to Spark for 0.10.1 so as to have
a complete pipeline. In the long term we need to rethink how we wanna do this.
seq2sparse was the big bottleneck in the legacy MR pipeline, not to mention
that there was no way to incrementally update the term vectors for new
streaming documents.
There have been discussions in the past about may be using Finite State
Automaton (which comes with Lucene since 4.0), or Word2Vec etc. See the
discussion in https://issues.apache.org/jira/browse/MAHOUT-1252
> Port seq2sparse to the Mahout spark-scala Environment
> -----------------------------------------------------
>
> Key: MAHOUT-1663
> URL: https://issues.apache.org/jira/browse/MAHOUT-1663
> Project: Mahout
> Issue Type: Improvement
> Affects Versions: 0.9
> Reporter: Andrew Palumbo
> Assignee: Gokhan Capan
> Labels: scala, spark
> Fix For: 0.10.1
>
>
> Implement a scala version of seq2sparse in the spark module. This effort is
> currently in progress.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)