On Wednesday July 4 2007 12:52 am, Filip de Waard wrote:
> Hello,
>
> Until today, I've never had a single worry about performance in my
> short but exciting Python experience. However, now I'm trying to
> index over six million books from a MySQL database using PyLucene and
> I'd like to speed it up.

You're pretty much just using Python here to move bits between a pair of 
libraries.  There's not much optimization to be done there.  FWIW, this is 
the sort of thing Python is good for, being the 2nd Best Language For 
Everything.

> Tomorrow I'll start playing with a profiler, but in the meantime:
> does anyone have any recommendations as to how to be most efficient
> in regard to the Python code, database interaction and of course the
> PyLucene indexing process? Or maybe I'm doing something horribly
> wrong in my script?

Yup, several problems.

1. You need an ORDER BY clause with SELECT ... LIMIT, otherwise you'll get a 
random set of rows each time. Well, at least every other database on earth 
works like this, but then again, we're talking about MySQL.

2. You need configure your IndexWriter for bulk loading.  See the 
setMaxBufferedDocs, setMaxMergeDocs & setMergeFactor methods at 
http://lucene.apache.org/java/2_0_0/api/org/apache/lucene/index/IndexWriter.html
 
IIRC, these have changed slightly in 2.1.  Basically, you'll want to increase 
those values as much as possible without causing your script to swap, and run 
an optimize() at the end.

3.  Make sure you have sufficient hardware.  More RAM will allow you to bump 
the above values.  A fast disk / properly configured RAID array will help 
too, but RAM's the bottleneck.

> Any pointer would be most appreciated.

Good luck *searching* an index of this size.  I dunno what an acceptable 
response time is for your application, but you're gonna need a pretty beefy 
box with *lots* of RAM to query 6 million documents quickly.

--Pete

-- 
Peter Fein   ||   773-575-0694   ||   [EMAIL PROTECTED]
http://www.pobox.com/~pfein/   ||   PGP: 0xCCF6AE6B
irc: [EMAIL PROTECTED]   ||   jabber: [EMAIL PROTECTED]
_______________________________________________
pylucene-dev mailing list
[email protected]
http://lists.osafoundation.org/mailman/listinfo/pylucene-dev

Reply via email to