After two rather frustrating days, I find I need to apologize to Lucene. My last run of 225 messages averaged around 25 milliseconds per message--that's parsing the xml, creating the Document, and putting it in the index (2.5Ghz cpu, 1G ram). Turns out the performance problem was xerces sax "helping me" by loading the DTD before it parsed each message and the DTD wasn't local to our site. After seeing Terry's response, I knew there had to be more going on than what I was assuming.
Thanks for the suggestions. I wonder how much faster I can go if I implement some of those? Regards Scott -----Original Message----- From: Terry Steichen [mailto:[EMAIL PROTECTED] Sent: Tuesday, January 06, 2004 5:48 AM To: Lucene Users List Subject: Re: Performance question Scott, Here are some figures to use for comparision. Using the latest Lucene release, I index about 200 similar-sized XML files at a time, on a Windows XP machine (2Ghz). First I create a new index, which adds the documents at a rate of about 8 per second (I don't recall what the cpu % is during this). Then I merge this new index with the master one (using, I think, the default merge factor), which takes about 4.5 minutes (during which time the cpu utilization stays near 100%). The master index currently holds about 115,000 such documents. HTH, Regards, Terry ----- Original Message ----- From: "Scott Smith" <[EMAIL PROTECTED]> To: <[EMAIL PROTECTED]> Sent: Monday, January 05, 2004 10:26 PM Subject: Performance question > I have an application that is reading in XML files and indexing them. Each > XML file is 3K-6K bytes. This application preloads a database that I > will add to "on the fly" later. However, all I want it to do > initially is take some existing files and create the initial index as > quick as I can. > > Since I want to index "on the fly" later, I set the merge factor to > 10. I'm > assuming that I can't create the index initially with one merge factor > (e.g., 100) and then change the merge factor later (true?). > > What I see is that it takes 1-3 seconds per xml file to do the index. This > means I'm indexing around 150k bytes per minute. I also notice that > the CPU > utilization rarely exceeds 5% (looking at task manager on a Windows > box). I > use Xerces to read in the files (SAX interface) and I don't close or > optimize the index between stories nor do I sleep anyplace. I've > looked at > the page fault numbers and they aren't changing much. I guess I would have > expected that I would have pretty much pegged the CPU and seen much > faster indexing. > > Any ideas/suggestions? > > --------------------------------------------------------------------- > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > > --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
