Re: LDA topic Modeling spark + python

2016-02-29 Thread Bryan Cutler
The input into LDA.train needs to be an RDD of a list with the first
element an integer (id) and the second a pyspark.mllib.Vector object
containing real numbers (term counts), i.e. an RDD of [doc_id,

>From your example, it looks like your corpus is a list with an zero-based
id, with the second element a tuple of user id and list of lines from the
data that have that user_id, something like [doc_id, (user_id, [line0,

You need to make that element a Vector containing real numbers somehow.

On Sun, Feb 28, 2016 at 11:08 PM, Mishra, Abhishek <> wrote:

> Hello Bryan,
> Thank you for the update on Jira. I took your code and tried with mine.
> But I get an error with the vector being created. Please see my code below
> and suggest me.
> My input file has some contents like this:
> "user_id","status"
> "0026c10bbbc7eeb55a61ab696ca93923","http:
> **bobsnewline**
> tiftakar, Trudy Darmanin  <3?"
> "0026c10bbbc7eeb55a61ab696ca93923","Brandon Cachia ,All I know is
> that,you're so nice."
> "0026c10bbbc7eeb55a61ab696ca93923","Melissa Zejtunija:HAM AND CHEESE BIEX
> INI??? **bobsnewline**  Kirr:bit tigieg mel **bobsnewline**  Melissa
> Zejtunija :jaq le mandix aptit tigieg **bobsnewline**  Kirr:int bis
> serjeta?"
> "0026c10bbbc7eeb55a61ab696ca93923",".Where is my mind?"
> And what I am doing in my code is like this:
> import string
> from pyspark.sql import SQLContext
> from pyspark import SparkConf, SparkContext
> from pyspark.sql import SQLContext
> from pyspark.mllib.clustering import LDA, LDAModel
> from nltk.tokenize import word_tokenize
> from stop_words import get_stop_words
> from nltk.stem.porter import PorterStemmer
> from gensim import corpora, models
> import gensim
> import textmining
> import pandas as pd
> conf = SparkConf().setAppName("building a warehouse")
> sc = SparkContext(conf=conf)
> sql_sc = SQLContext(sc)
> data = sc.textFile('file:///home/cloudera/LDA-Model/Pyspark/test1.csv')
> header = data.first() #extract header
> print header
> data = data.filter(lambda x:x !=header)#filter out header
> pairs = x: (x.split(',')[0], x))#.collect()#generate pair
> rdd key value
> #data11=data.subtractByKey(header)
> #print pairs.collect()
> (x,y): (x, [y])).reduceByKey(lambda a, b: a + b)
> grouped=pairs.groupByKey()#grouping values as per key
> #print grouped.collectAsMap()
> grouped_val= x : (list(x[1]))).collect()
> (x,y):(x,[y]))
> #df_grouped_val=sql_sc.createDataFrame(rr, ["user_id", "status"])
> #print list(enumerate(grouped_val))
> #corpus = grouped.zipWithIndex().map(lambda x: [x[1],
> x[0]]).cache()#.collect()
> corpus = grouped.zipWithIndex().map(lambda (term_counts, doc_id): [doc_id,
> term_counts]).cache()
> #corpus.cache()
> model = LDA.train(corpus, k=10, maxIterations=10, optimizer="online")
> #ldaModel = LDA.train(corpus, k=3)
> print corpus
> topics = model.describeTopics(3)
> print("\"topic\", \"termIndices\", \"termWeights\"")
> for i, t in enumerate(topics):
>print("%d, %s, %s" % (i, str(t[0]), str(t[1])))
> sc.stop()
> Please help me in this
> Abhishek
> *From:* Bryan Cutler []
> *Sent:* Friday, February 26, 2016 4:17 AM
> *To:* Mishra, Abhishek
> *Cc:*
> *Subject:* Re: LDA topic Modeling spark + python
> I'm not exactly sure how you would like to setup your LDA model, but I
> noticed there was no Python example for LDA in Spark.  I created this issue
> to add it  Keep an eye
> on this if it could be of help.
> bryan
> On Wed, Feb 24, 2016 at 8:34 PM, Mishra, Abhishek <
>> wrote:
> Hello All,
> If someone has any leads on this please help me.
> Sincerely,
> Abhishek
> *From:* Mishra, Abhishek
> *Sent:* Wednesday, February 24, 2016 5:11 PM
> *To:*
> *Subject:* LDA topic Modeling spark + python
> Hello All,
> I am doing a LDA model, please guide me with something.
> I have a csv file which has two column "user_id" and "status". I have to
> generate a word-topic distribution after aggregating the user_id. Meaning
> to say I need to model it for users on their grouped status. The topic
> length being 2000 and value of k or number of words being 3.
> Please, if you can provide me with some link or some code base on spark
> with python ; I would be grateful.
> Looking forward for a  reply,
> Sincerely,
> Abhishek

Re: LDA topic Modeling spark + python

2016-02-25 Thread Bryan Cutler
I'm not exactly sure how you would like to setup your LDA model, but I
noticed there was no Python example for LDA in Spark.  I created this issue
to add it  Keep an eye
on this if it could be of help.


On Wed, Feb 24, 2016 at 8:34 PM, Mishra, Abhishek  wrote:

> Hello All,
> If someone has any leads on this please help me.
> Sincerely,
> Abhishek
> *From:* Mishra, Abhishek
> *Sent:* Wednesday, February 24, 2016 5:11 PM
> *To:*
> *Subject:* LDA topic Modeling spark + python
> Hello All,
> I am doing a LDA model, please guide me with something.
> I have a csv file which has two column "user_id" and "status". I have to
> generate a word-topic distribution after aggregating the user_id. Meaning
> to say I need to model it for users on their grouped status. The topic
> length being 2000 and value of k or number of words being 3.
> Please, if you can provide me with some link or some code base on spark
> with python ; I would be grateful.
> Looking forward for a  reply,
> Sincerely,
> Abhishek

RE: LDA topic Modeling spark + python

2016-02-24 Thread Mishra, Abhishek
Hello All,

If someone has any leads on this please help me.


From: Mishra, Abhishek
Sent: Wednesday, February 24, 2016 5:11 PM
Subject: LDA topic Modeling spark + python

Hello All,

I am doing a LDA model, please guide me with something.

I have a csv file which has two column "user_id" and "status". I have to 
generate a word-topic distribution after aggregating the user_id. Meaning to 
say I need to model it for users on their grouped status. The topic length 
being 2000 and value of k or number of words being 3.

Please, if you can provide me with some link or some code base on spark with 
python ; I would be grateful.

Looking forward for a  reply,

