[pylucene-dev] Lucene 2.3.0 JCC - Search Optimization Solutions

João Rodrigues Sat, 23 Feb 2008 14:24:22 -0800

Hello all!

I've finally got round to setup Lucene 2.3.0 in my two production boxes
(Ubuntu 7.10 and Windows XP), after quite a trouble with the JCC compilation
methods. Now, I have my application all up and running and.... it's damn
slow :(


I'll be brief: I have a 6.6GB index, with more than 5.000.000 biomedical
abstracts indexed. Each document has two fields: an integer, which I will
want to retrieve upon search (the ID of the document, sort of), and an 80
words, stored, tokenized, string, which will be searched upon. So, I insert
the query (say, foo bar), it builds previously sort of a "boolean query"
with a format such as: 'foo' AND 'bar'. Then it parses it and spits out the
results.

Problem is, unlike most of the posts I've read, I don't want the first 10 or
100 results. I want the first 10.000, or even all of them. I've read an
HitCollector is due for this task, but my first search on google got me an
expressive "HitCollector is too slow on PyLucene", so, I kind of sorted out
that option. It takes minutes to get me the results I need, as it is right
now. I'll post the code on pastebin and link it for those who feel in a good
mood to read n00b's code and help (see below). I've tracked down the problem
to the "doc.get("PMID")" method in the Searcher function.

My question is: how can I make my search faster? My index wasn't optimized
because it was huge and it was built with GCC. By now, it is probably
optimized (I left an optimizer running last night) so, that is taken care
of. I've considered threading as well, since I'll perform three different
searches for "round". Thing is, I'm pretty green when it comes to
programming (I'm a biologist) and I've never understood pretty much how
threading works. If someone can point me to the right tutorial or
documentation, I'd be glad enough to hack it up myself. Another option I've
been given was to use an implementation of Lucene written in either C# or
C++. However, Lucene.net isn't up to date, and neither is CLucene..

So, if you think you can give out a tip on how to make my script run faster,
I'd thank you more than a lot. It's a shame that my project fails because of
this technical handicap :(

LINKS:

http://pastebin.com/m6c384ede -> Main Code
http://pastebin.com/m3484ebfc --> Searcher Functions

Best regards to you all,

João Rodrigues

_______________________________________________
pylucene-dev mailing list
[email protected]
http://lists.osafoundation.org/mailman/listinfo/pylucene-dev

[pylucene-dev] Lucene 2.3.0 JCC - Search Optimization Solutions

Reply via email to