[jira] Commented: (MAHOUT-123) Implement Latent Dirichlet Allocation

Yanen Li (JIRA) Mon, 03 Aug 2009 15:35:41 -0700

    [ 
https://issues.apache.org/jira/browse/MAHOUT-123?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12738646#action_12738646
 ]


Yanen Li commented on MAHOUT-123:
---------------------------------

Now index is created, vectors are also created without a problem, but
there still be exceptions in the gson parser when running LDA in the
stand along mode:

====================================================================================

[WARNING] While downloading easymock:easymockclassextension:2.2
  This artifact has been relocated to org.easymock:easymockclassextension:2.2.


[INFO] [exec:java]
09/08/03 15:18:35 INFO lda.LDADriver: Iteration 0
09/08/03 15:18:35 INFO jvm.JvmMetrics: Initializing JVM Metrics with
processName=JobTracker, sessionId=
09/08/03 15:18:35 WARN mapred.JobClient: Use GenericOptionsParser for
parsing the arguments. Applications should implement Tool for the
same.
09/08/03 15:18:35 WARN mapred.JobClient: No job jar file set.  User
classes may not be found. See JobConf(Class) or
JobConf#setJar(String).
09/08/03 15:18:35 INFO input.FileInputFormat: Total input paths to process : 1
09/08/03 15:18:36 INFO input.FileInputFormat: Total input paths to process : 1
09/08/03 15:18:36 INFO mapred.JobClient: Running job: job_local_0001
09/08/03 15:18:36 INFO mapred.MapTask: io.sort.mb = 100
09/08/03 15:18:36 INFO mapred.MapTask: data buffer = 79691776/99614720
09/08/03 15:18:36 INFO mapred.MapTask: record buffer = 262144/327680
09/08/03 15:18:37 INFO mapred.JobClient:  map 0% reduce 0%
09/08/03 15:18:37 WARN mapred.LocalJobRunner: job_local_0001
com.google.gson.JsonParseException: Failed parsing JSON source:
java.io.stringrea...@4977fa9a to Json
        at com.google.gson.JsonParser.parse(JsonParser.java:57)
        at com.google.gson.Gson.fromJson(Gson.java:376)
        at com.google.gson.Gson.fromJson(Gson.java:329)
        at 
org.apache.mahout.matrix.AbstractVector.decodeVector(AbstractVector.java:358)
        at 
org.apache.mahout.matrix.AbstractVector.decodeVector(AbstractVector.java:342)
        at org.apache.mahout.clustering.lda.LDAMapper.map(LDAMapper.java:48)
        at org.apache.mahout.clustering.lda.LDAMapper.map(LDAMapper.java:39)
        at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144)
        at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:518)
        at org.apache.hadoop.mapred.MapTask.run(MapTask.java:303)
        at 
org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:176)
Caused by: com.google.gson.ParseException: Encountered "SEQ" at line
1, column 1.
Was expecting one of:
    <DIGITS> ...
    "null" ...
    "NaN" ...
    "Infinity" ...
    <BOOLEAN> ...
    <SINGLE_QUOTE_LITERAL> ...
    <DOUBLE_QUOTE_LITERAL> ...
    ")]}\'\n" ...
    "{" ...
    "[" ...
    "-" ...

====================================================================================



Yanen


> Implement Latent Dirichlet Allocation
> -------------------------------------
>
>                 Key: MAHOUT-123
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-123
>             Project: Mahout
>          Issue Type: New Feature
>          Components: Clustering
>    Affects Versions: 0.2
>            Reporter: David Hall
>            Assignee: Grant Ingersoll
>             Fix For: 0.2
>
>         Attachments: lda.patch, MAHOUT-123.patch, MAHOUT-123.patch, 
> MAHOUT-123.patch, MAHOUT-123.patch, MAHOUT-123.patch, MAHOUT-123.patch, 
> MAHOUT-123.patch, MAHOUT-123.patch
>
>   Original Estimate: 504h
>  Remaining Estimate: 504h
>
> (For GSoC)
> Abstract:
> Latent Dirichlet Allocation (Blei et al, 2003) is a powerful learning
> algorithm for automatically and jointly clustering words into "topics"
> and documents into mixtures of topics, and it has been successfully
> applied to model change in scientific fields over time (Griffiths and
> Steyver, 2004; Hall, et al. 2008). In this project, I propose to
> implement a distributed variant of Latent Dirichlet Allocation using
> MapReduce, and, time permitting, to investigate extensions of LDA and
> possibly more efficient algorithms for distributed inference.
> Detailed Description:
> A topic model is, roughly, a hierarchical Bayesian model that
> associates with each document a probability distribution over
> "topics", which are in turn distributions over words. For instance, a
> topic in a collection of newswire might include words about "sports",
> such as "baseball", "home run", "player", and a document about steroid
> use in baseball might include "sports", "drugs", and "politics". Note
> that the labels "sports", "drugs", and "politics", are post-hoc labels
> assigned by a human, and that the algorithm itself only assigns
> associate words with probabilities. The task of parameter estimation
> in these models is to learn both what these topics are, and which
> documents employ them in what proportions.
> One of the promises of unsupervised learning algorithms like Latent
> Dirichlet Allocation (LDA; Blei et al, 2003) is the ability to take a
> massive collections of documents and condense them down into a
> collection of easily understandable topics. However, all available
> open source implementations of LDA and related topics models are not
> distributed, which hampers their utility. This project seeks to
> correct this shortcoming.
> In the literature, there have been several proposals for paralellzing
> LDA. Newman, et al (2007) proposed to create an "approximate" LDA in
> which each processors gets its own subset of the documents to run
> Gibbs sampling over. However, Gibbs sampling is slow and stochastic by
> its very nature, which is not advantageous for repeated runs. Instead,
> I propose to follow Nallapati, et al. (2007) and use a variational
> approximation that is fast and non-random.
> References:
> David M. Blei, J McAuliffe. Supervised Topic Models. NIPS, 2007.
> David M. Blei , Andrew Y. Ng , Michael I. Jordan, Latent dirichlet
> allocation, The Journal of Machine Learning Research, 3, p.993-1022,
> 3/1/2003
> T. L. Griffiths and M. Steyvers. Finding scientiﬁc topics. Proc Natl
> Acad Sci U S A, 101 Suppl 1: 5228-5235, April 2004.
> David LW Hall, Daniel Jurafsky, and Christopher D. Manning. Studying
> the History of Ideas Using Topic Models. EMNLP, Honolulu, 2008.
> Ramesh Nallapati, William Cohen, John Lafferty, Parallelized
> variational EM for Latent Dirichlet Allocation: An experimental
> evaluation of speed and scalability, ICDM workshop on high performance
> data mining, 2007.
> Newman, D., Asuncion, A., Smyth, P., & Welling, M. Distributed
> Inference for Latent Dirichlet Allocation. NIPS, 2007.
> Xuerui Wang , Andrew McCallum, Topics over time: a non-Markov
> continuous-time model of topical trends. KDD, 2006
> Wolfe, J., Haghighi, A, and Klein, D. Fully distributed EM for very
> large datasets. ICML, 2008.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (MAHOUT-123) Implement Latent Dirichlet Allocation

Reply via email to