Since you are trying this anyway, and looking for ways to improve indexing times... Could you perhaps try to replace use of java.io.RandomAccessFile in FSDirectory implementation, with the attached implementation? It supposedly increases I/O throughput by orders of magnitude, by using partial buffering.
Terry Steichen wrote:
Mike,
By way of comparison, I've got a collection of about 50,000 XML files, each of which averages about 8K. It takes about 1.25 hours to index (on a 1.8Ghz machine). I use basically the standard configuration (mergeFactor, etc.) and I've got about 30 fields per document. I add about 200 new ones per day. I don't recall how long that it takes to index the 200 (I do it through a background task), but it takes a couple of minutes to merge the new 200 document index with the master index.
HTH,
Terry
----- Original Message ----- From: "Michael Barry" <[EMAIL PROTECTED]> To: "Lucene Users List" <[EMAIL PROTECTED]> Sent: Monday, February 24, 2003 2:00 PM Subject: Indexing Tips and Hints
All, I'm in need of some pointers, hints or tips on indexing large
collections
of data. I know I saw some tips on this list before but when I tried searching the list, I came up blank. I have a large collection of XML files (336000 files around 5K apiece) that I'm indexing and its taking quite a bit of time (27 hours). I've played around with the mergeFactor, RAMDirectories and multiple threads (X number of threads indexing a subset of the data and then merging the indexes at the end) but I cannot seem to bring the time down. I'm probably not doing these things properly but from what I read I believe I am. Maybe this is the best I can do with this data but I would be really grateful to hear how others have tackled this same issue. As always pointers to places in the mailing list archive or other places would be appreciated.
Thanks, Mike.
--------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
--------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
--
-- Best regards, Andrzej Bialecki
------------------------------------------------- Software Architect, System Integration Specialist ------------------------------------------------- FreeBSD developer (http://www.freebsd.org)
BufferedRandomAccessFile.zip
Description: Zip compressed data--------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
