On Wednesday July 4 2007 12:52 am, Filip de Waard wrote: > Hello, > > Until today, I've never had a single worry about performance in my > short but exciting Python experience. However, now I'm trying to > index over six million books from a MySQL database using PyLucene and > I'd like to speed it up.
You're pretty much just using Python here to move bits between a pair of libraries. There's not much optimization to be done there. FWIW, this is the sort of thing Python is good for, being the 2nd Best Language For Everything. > Tomorrow I'll start playing with a profiler, but in the meantime: > does anyone have any recommendations as to how to be most efficient > in regard to the Python code, database interaction and of course the > PyLucene indexing process? Or maybe I'm doing something horribly > wrong in my script? Yup, several problems. 1. You need an ORDER BY clause with SELECT ... LIMIT, otherwise you'll get a random set of rows each time. Well, at least every other database on earth works like this, but then again, we're talking about MySQL. 2. You need configure your IndexWriter for bulk loading. See the setMaxBufferedDocs, setMaxMergeDocs & setMergeFactor methods at http://lucene.apache.org/java/2_0_0/api/org/apache/lucene/index/IndexWriter.html IIRC, these have changed slightly in 2.1. Basically, you'll want to increase those values as much as possible without causing your script to swap, and run an optimize() at the end. 3. Make sure you have sufficient hardware. More RAM will allow you to bump the above values. A fast disk / properly configured RAID array will help too, but RAM's the bottleneck. > Any pointer would be most appreciated. Good luck *searching* an index of this size. I dunno what an acceptable response time is for your application, but you're gonna need a pretty beefy box with *lots* of RAM to query 6 million documents quickly. --Pete -- Peter Fein || 773-575-0694 || [EMAIL PROTECTED] http://www.pobox.com/~pfein/ || PGP: 0xCCF6AE6B irc: [EMAIL PROTECTED] || jabber: [EMAIL PROTECTED] _______________________________________________ pylucene-dev mailing list [email protected] http://lists.osafoundation.org/mailman/listinfo/pylucene-dev
