[ 
https://issues.apache.org/jira/browse/MAHOUT-1663?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14389283#comment-14389283
 ] 

Suneel Marthi commented on MAHOUT-1663:
---------------------------------------

Its fine with porting the existing seq2sparse to Spark for 0.10.1 so as to have 
a complete pipeline.  In the long term we need to rethink how we wanna do this. 
seq2sparse was the big bottleneck in the legacy MR pipeline, not to mention 
that there was no way to incrementally update the term vectors for new 
streaming documents.

There have been discussions in the past about may be using Finite State 
Automaton (which comes with Lucene since 4.0), or Word2Vec etc. See the 
discussion in https://issues.apache.org/jira/browse/MAHOUT-1252



> Port seq2sparse to the Mahout spark-scala Environment
> -----------------------------------------------------
>
>                 Key: MAHOUT-1663
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-1663
>             Project: Mahout
>          Issue Type: Improvement
>    Affects Versions: 0.9
>            Reporter: Andrew Palumbo
>            Assignee: Gokhan Capan
>              Labels: scala, spark
>             Fix For: 0.10.1
>
>
> Implement a scala version of seq2sparse in the spark module.  This effort is 
> currently in progress.  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to