[pylucene-dev] Clustering with PyLucene

Sebastian Steins Fri, 07 Mar 2008 07:28:19 -0800

Hi there,

first of all, I want to introduce myself, because I have not postedhere before. My name is Sebastian and I am currently working as webapplication developer. Search and Information Retrieval are not partof my current work, but I am interested in those fields as a hobby.I was triggered by the book "Programming Collective Intelligence"which describes very complex algorithms like clustering in a very easyway and shows the solutions in plain python code with SQLite bindings.

I was very ardent by the solutions in the book, so I tried to modifythem for some experiments and I wanted to use Lucene instead ofPyLucene.

For now, I have a simple script which inserts articles from a RSS feedinto a Lucene index using PyLucene.


An article has also outgoing links, which I store this way:
#### Code ####
for link in params['links']:

doc.add(self.Lucene.Field("linksto", link,self.Lucene.Field.Store.YES, self.Lucene.Field.Index.UN_TOKENIZED))

#### /Code ####

Is that a good way? Or is there another way in Lucene to store"relational" data? How would it be possible to retrieve the documentwith the most incoming links? Or the document with the greatest numberof outgoing links?

Additionally, I want to calculate the similarity between documentswith my script, using K-Means, Dendograms and other things (mostlydescribed in the book mentioned above). Therefore, I would have tocompare a recently found (crawled) article, which is to be written tothe index with all articles in the Lucene index. How can that beachieved in a more elegant way than doing a for-loop from 0 tonumDocs()? Is there a cheaper (in means of computer ressources) way?

Unfortunatelly, I am not very familar with Java, so my reasearch forthe above questions in the sites around the Lucene-community did nothelp really. I found Mahout, a Java-prorgramm for k-means and othersimilar algorithms for Lucene. However, this didn't help, because Iwant to implement my experiments in Python, not Java.



Thank you very much for your help!



Sebastian
_______________________________________________
pylucene-dev mailing list
[email protected]
http://lists.osafoundation.org/mailman/listinfo/pylucene-dev

[pylucene-dev] Clustering with PyLucene

Reply via email to