[ https://issues.apache.org/jira/browse/SPARK-5571?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14632052#comment-14632052 ]
Joseph K. Bradley commented on SPARK-5571: ------------------------------------------ Stemmer: We'll need to be careful about adding dependencies on other libraries. We strongly prefer avoiding that if possible. If code can be copied and modified (assuming the license is friendly to copying), that might be preferable if the code is relatively simple. Stopwords: Sounds good. LDA.runText: I'd prefer this handle everything automatically: A user gives an unfiltered corpus and LDA handles it. This actually probably requires a quick design doc since I have not thought through the complexities. Pipeline: I agree this might work well under the Pipelines API. Here's what I propose: * For now, we focus on adding the necessary transformers individually: stemmer, stopwords filter. * For the next release, we design a good way to provide this functionality under Pipelines. If that sounds good, we can create & link JIRAs for those transformers, and I'll move the target version for this JIRA to 1.6. What do you think? > LDA should handle text as well > ------------------------------ > > Key: SPARK-5571 > URL: https://issues.apache.org/jira/browse/SPARK-5571 > Project: Spark > Issue Type: Improvement > Components: MLlib > Affects Versions: 1.3.0 > Reporter: Joseph K. Bradley > > Latent Dirichlet Allocation (LDA) currently operates only on vectors of word > counts. It should also supporting training and prediction using text > (Strings). > This plan is sketched in the [original LDA design > doc|https://docs.google.com/document/d/1kSsDqTeZMEB94Bs4GTd0mvdAmduvZSSkpoSfn-seAzo/edit?usp=sharing]. > There should be: > * runWithText() method which takes an RDD with a collection of Strings (bags > of words). This will also index terms and compute a dictionary. > * dictionary parameter for when LDA is run with word count vectors > * prediction/feedback methods returning Strings (such as > describeTopicsAsStrings, which is commented out in LDA currently) -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org