Re: [pylucene-dev] Clustering with PyLucene

Andi Vajda Fri, 07 Mar 2008 08:49:32 -0800

On Mar 7, 2008, at 7:28, Sebastian Steins <[EMAIL PROTECTED]>wrote:

Hi there,
first of all, I want to introduce myself, because I have not postedhere before. My name is Sebastian and I am currently working as webapplication developer. Search and Information Retrieval are not partof my current work, but I am interested in those fields as a hobby.I was triggered by the book "Programming Collective Intelligence"which describes very complex algorithms like clustering in a veryeasy way and shows the solutions in plain python code with SQLitebindings.
I was very ardent by the solutions in the book, so I tried to modifythem for some experiments and I wanted to use Lucene instead ofPyLucene.


Did you mean to say "Lucene instead of SQLlite" ?

For now, I have a simple script which inserts articles from a RSSfeed into a Lucene index using PyLucene.
An article has also outgoing links, which I store this way:
#### Code ####
for link in params['links']:
doc.add(self.Lucene.Field("linksto", link,self.Lucene.Field.Store.YES, self.Lucene.Field.Index.UN_TOKENIZED))
#### /Code ####
Is that a good way? Or is there another way in Lucene to store"relational" data? How would it be possible to retrieve the documentwith the most incoming links? Or the document with the greatestnumber of outgoing links?

Lucene is not relational database but I assume you know that. Itindexes text tokens and returns documents that contain them.

Additionally, I want to calculate the similarity between documentswith my script, using K-Means, Dendograms and other things (mostlydescribed in the book mentioned above). Therefore, I would have tocompare a recently found (crawled) article, which is to be writtento the index with all articles in the Lucene index. How can that beachieved in a more elegant way than doing a for-loop from 0 tonumDocs()? Is there a cheaper (in means of computer ressources) way?

Lucene can do that for you if you index your documents with termvectors. The "Lucene in Action" book (recommended reading) has anexample on how to implement this. The sample code (calledMoreLikeThis) is available in PyLucene in Python

Unfortunatelly, I am not very familar with Java, so my reasearch forthe above questions in the sites around the Lucene-community did nothelp really.


Try harder.

The [EMAIL PROTECTED] has a large community of users thatcan help you with Lucene how-to questions that are independent of theeventual implementation language.

Lucene is implemented in Java, PyLucene just wraps it with Pythonwrappers. Familiarity with Java code and docs can be helpful. Thepylucene-dev list you just wrote to is about specific issues relatedto that, not general Lucene how-to topics.

I found Mahout, a Java-prorgramm for k-means and other similaralgorithms for Lucene. However, this didn't help, because I want toimplement my experiments in Python, not Java.

Get the theory questions answered first with the book, the web and thejava-user list. Then, take a look at the many python samples that shipwith PyLucene to see how you could apply the solutions you found withPython.


Andi..


Thank you very much for your help!



Sebastian
_______________________________________________
pylucene-dev mailing list
[email protected]
http://lists.osafoundation.org/mailman/listinfo/pylucene-dev

_______________________________________________
pylucene-dev mailing list
[email protected]
http://lists.osafoundation.org/mailman/listinfo/pylucene-dev

Re: [pylucene-dev] Clustering with PyLucene

Reply via email to