[ 
https://issues.apache.org/jira/browse/SPARK-5571?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14632052#comment-14632052
 ] 

Joseph K. Bradley commented on SPARK-5571:
------------------------------------------

Stemmer: We'll need to be careful about adding dependencies on other libraries. 
 We strongly prefer avoiding that if possible.  If code can be copied and 
modified (assuming the license is friendly to copying), that might be 
preferable if the code is relatively simple.

Stopwords: Sounds good.

LDA.runText: I'd prefer this handle everything automatically: A user gives an 
unfiltered corpus and LDA handles it.  This actually probably requires a quick 
design doc since I have not thought through the complexities.

Pipeline: I agree this might work well under the Pipelines API.  Here's what I 
propose:
* For now, we focus on adding the necessary transformers individually: stemmer, 
stopwords filter.
* For the next release, we design a good way to provide this functionality 
under Pipelines.

If that sounds good, we can create & link JIRAs for those transformers, and 
I'll move the target version for this JIRA to 1.6.  What do you think?

> LDA should handle text as well
> ------------------------------
>
>                 Key: SPARK-5571
>                 URL: https://issues.apache.org/jira/browse/SPARK-5571
>             Project: Spark
>          Issue Type: Improvement
>          Components: MLlib
>    Affects Versions: 1.3.0
>            Reporter: Joseph K. Bradley
>
> Latent Dirichlet Allocation (LDA) currently operates only on vectors of word 
> counts.  It should also supporting training and prediction using text 
> (Strings).
> This plan is sketched in the [original LDA design 
> doc|https://docs.google.com/document/d/1kSsDqTeZMEB94Bs4GTd0mvdAmduvZSSkpoSfn-seAzo/edit?usp=sharing].
> There should be:
> * runWithText() method which takes an RDD with a collection of Strings (bags 
> of words).  This will also index terms and compute a dictionary.
> * dictionary parameter for when LDA is run with word count vectors
> * prediction/feedback methods returning Strings (such as 
> describeTopicsAsStrings, which is commented out in LDA currently)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to