[
https://issues.apache.org/jira/browse/SPARK-10809?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15098737#comment-15098737
]
Crawdaddy commented on SPARK-10809:
-----------------------------------
With a 100K document / 200K feature model with K = 250, even this
single-document topicDistributions method takes 40s (!) on my 12-core X5680
Dell.
This is with Spark 1.6 compiled with netlib and the native BLAS library
(OpenBLAS compiled to X56xx architecture).
The killer is in the first line:
{code:title=LDAModel.scala : topicDistribution|borderStyle=solid}
val expElogbeta =
exp(LDAUtils.dirichletExpectation(topicsMatrix.toBreeze.toDenseMatrix.t).t)
{code}
I don't see a reason expElogbeta can't be pre-computed outside the method,
since it has nothing to do with the input Vector. I made a little method to do
that:
{code}
def getExpElogbeta(): BDM[Double] = {
exp(LDAUtils.dirichletExpectation(topicsMatrix.toBreeze.toDenseMatrix.t).t)
}
{code}
then modified topicDistribution to take it in as a method parameter:
{code}
def topicDistribution(document: Vector, expElogbeta : BDM[Double]): Vector =
{...}
{code}
Now my predictions go from 40s to 150ms. That's more like it (though I hope I
can make it even faster - that's still slow in my world).
I'm new to Scala/Spark/MLLib so I didn't include a patch, but maybe [~yuhaoyan]
can review and suggest the most versatile implementation of this idea?
> Single-document topicDistributions method for LocalLDAModel
> -----------------------------------------------------------
>
> Key: SPARK-10809
> URL: https://issues.apache.org/jira/browse/SPARK-10809
> Project: Spark
> Issue Type: New Feature
> Components: MLlib
> Reporter: Joseph K. Bradley
> Assignee: yuhao yang
> Priority: Minor
> Fix For: 2.0.0
>
>
> We could provide a single-document topicDistributions method for
> LocalLDAModel to allow for quick queries which avoid RDD operations.
> Currently, the user must use an RDD of documents.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]