Re: [pylucene-dev] Lucene 2.3.0 JCC - Search Optimization Solutions

Alexandre Fiori Sun, 24 Feb 2008 04:24:20 -0800

Hi João,

The first thing to consider is optimizing the index. Lucene's FAQ also
recommend indexing and not storing fields in order to get better
performance, as well as having only on process opening the index for
reading/searching.


About the boolean query: It can slow down the search() but has nothing to do
with retrieving documents from 'hits'.

I believe that your main problem is related to index optimization. To make
sure, I've made some tests with one of my indexes that currently have 120
million records (~40GB), with 14 fields stored and indexed. It's not slow,
but not so fast - it's running on my laptop, turion 64 x2, 2GB ram.
Here are the results:

$ python search-list-10k.py
found 1283343 results in 0:00:00.871448
last 2000 took 0:00:00.620082 (2000 records read in 0:00:00.620082)
last 2000 took 0:00:00.220778 (4000 records read in 0:00:00.840860)
last 2000 took 0:00:00.092808 (6000 records read in 0:00:00.933668)
last 2000 took 0:00:00.232857 (8000 records read in 0:00:01.166525)
reading 10000 records took 0:00:01.269979

$ python search-list-100k.py
found 1283343 results in 0:00:00.873962
last 10000 took 0:00:01.252716 (10000 records read in 0:00:01.252716)
last 10000 took 0:00:00.564065 (20000 records read in 0:00:01.816781)
last 10000 took 0:00:00.604823 (30000 records read in 0:00:02.421604)
last 10000 took 0:00:00.424446 (40000 records read in 0:00:02.846050)
last 10000 took 0:00:00.422367 (50000 records read in 0:00:03.268417)
last 10000 took 0:00:00.783854 (60000 records read in 0:00:04.052271)
last 10000 took 0:00:00.386834 (70000 records read in 0:00:04.439105)
last 10000 took 0:00:00.374853 (80000 records read in 0:00:04.813958)
last 10000 took 0:00:00.394580 (90000 records read in 0:00:05.208538)
reading 100000 records took 0:00:05.608388

Script can be found in http://pastebin.ca/916177

Hope it helps.


Boa sorte ;)


AF

On Sat, Feb 23, 2008 at 7:24 PM, João Rodrigues <[EMAIL PROTECTED]> wrote:

> Hello all!
>
> I've finally got round to setup Lucene 2.3.0 in my two production boxes
> (Ubuntu 7.10 and Windows XP), after quite a trouble with the JCC
> compilation methods. Now, I have my application all up and running and....
> it's damn slow :(
>
> I'll be brief: I have a 6.6GB index, with more than 5.000.000 biomedical
> abstracts indexed. Each document has two fields: an integer, which I will
> want to retrieve upon search (the ID of the document, sort of), and an 80
> words, stored, tokenized, string, which will be searched upon. So, I insert
> the query (say, foo bar), it builds previously sort of a "boolean query"
> with a format such as: 'foo' AND 'bar'. Then it parses it and spits out the
> results.
>
> Problem is, unlike most of the posts I've read, I don't want the first 10
> or 100 results. I want the first 10.000, or even all of them. I've read an
> HitCollector is due for this task, but my first search on google got me an
> expressive "HitCollector is too slow on PyLucene", so, I kind of sorted out
> that option. It takes minutes to get me the results I need, as it is right
> now. I'll post the code on pastebin and link it for those who feel in a good
> mood to read n00b's code and help (see below). I've tracked down the problem
> to the "doc.get("PMID")" method in the Searcher function.
>
> My question is: how can I make my search faster? My index wasn't optimized
> because it was huge and it was built with GCC. By now, it is probably
> optimized (I left an optimizer running last night) so, that is taken care
> of. I've considered threading as well, since I'll perform three different
> searches for "round". Thing is, I'm pretty green when it comes to
> programming (I'm a biologist) and I've never understood pretty much how
> threading works. If someone can point me to the right tutorial or
> documentation, I'd be glad enough to hack it up myself. Another option I've
> been given was to use an implementation of Lucene written in either C# or
> C++. However, Lucene.net isn't up to date, and neither is CLucene..
>
> So, if you think you can give out a tip on how to make my script run
> faster, I'd thank you more than a lot. It's a shame that my project fails
> because of this technical handicap :(
>
> LINKS:
>
> http://pastebin.com/m6c384ede -> Main Code
> http://pastebin.com/m3484ebfc --> Searcher Functions
>
> Best regards to you all,
>
> João Rodrigues
>
> _______________________________________________
> pylucene-dev mailing list
> [email protected]
> http://lists.osafoundation.org/mailman/listinfo/pylucene-dev
>
>


-- 
Ship ahoy! Hast seen the While Whale?
 - Melville's Captain Ahab

_______________________________________________
pylucene-dev mailing list
[email protected]
http://lists.osafoundation.org/mailman/listinfo/pylucene-dev

Re: [pylucene-dev] Lucene 2.3.0 JCC - Search Optimization Solutions

Reply via email to