Scott, FYI - I happen to be using dom4j. I might add, however, that I parse each document into about 18 fields. If, as Dror says, your document is simpler, (or if you increase the merge factor) you can probably better that.
Regards, Terry ----- Original Message ----- From: "Dror Matalon" <[EMAIL PROTECTED]> To: "Lucene Users List" <[EMAIL PROTECTED]> Sent: Wednesday, January 07, 2004 10:48 PM Subject: Re: Performance question > On Wed, Jan 07, 2004 at 07:24:22PM -0700, Scott Smith wrote: > > After two rather frustrating days, I find I need to apologize to Lucene. My > > last run of 225 messages averaged around 25 milliseconds per message--that's > > parsing the xml, creating the Document, and putting it in the index (2.5Ghz > > cpu, 1G ram). Turns out the performance problem was xerces sax "helping me" > > by loading the DTD before it parsed each message and the DTD wasn't local to > > our site. After seeing Terry's response, I knew there had to be more going > > on than what I was assuming. > > > > Thanks for the suggestions. I wonder how much faster I can go if I > > implement some of those? > > 25 msecs to insert a document is on the high side, but it depends of > course on the size of your document. You're probably spending 90% of > your time in the XML parsing. I believe that there are other parsers > that are faster than xerces, you might want to look at these. You might > want to look at http://dom4j.org/. > > Dror > > > > > > Regards > > > > Scott > > > > -----Original Message----- > > From: Terry Steichen [mailto:[EMAIL PROTECTED] > > Sent: Tuesday, January 06, 2004 5:48 AM > > To: Lucene Users List > > Subject: Re: Performance question > > > > > > Scott, > > > > Here are some figures to use for comparision. Using the latest Lucene > > release, I index about 200 similar-sized XML files at a time, on a Windows > > XP machine (2Ghz). First I create a new index, which adds the documents at > > a rate of about 8 per second (I don't recall what the cpu % is during this). > > Then I merge this new index with the master one (using, I think, the default > > merge factor), which takes about 4.5 minutes (during which time the cpu > > utilization stays near 100%). The master index currently holds about > > 115,000 such documents. > > > > HTH, > > > > Regards, > > > > Terry > > > > ----- Original Message ----- > > From: "Scott Smith" <[EMAIL PROTECTED]> > > To: <[EMAIL PROTECTED]> > > Sent: Monday, January 05, 2004 10:26 PM > > Subject: Performance question > > > > > > > I have an application that is reading in XML files and indexing them. > > Each > > > XML file is 3K-6K bytes. This application preloads a database that I > > > will add to "on the fly" later. However, all I want it to do > > > initially is take some existing files and create the initial index as > > > quick as I can. > > > > > > Since I want to index "on the fly" later, I set the merge factor to > > > 10. > > I'm > > > assuming that I can't create the index initially with one merge factor > > > (e.g., 100) and then change the merge factor later (true?). > > > > > > What I see is that it takes 1-3 seconds per xml file to do the index. > > This > > > means I'm indexing around 150k bytes per minute. I also notice that > > > the > > CPU > > > utilization rarely exceeds 5% (looking at task manager on a Windows > > > box). > > I > > > use Xerces to read in the files (SAX interface) and I don't close or > > > optimize the index between stories nor do I sleep anyplace. I've > > > looked > > at > > > the page fault numbers and they aren't changing much. I guess I would > > have > > > expected that I would have pretty much pegged the CPU and seen much > > > faster indexing. > > > > > > Any ideas/suggestions? > > > > > > --------------------------------------------------------------------- > > > To unsubscribe, e-mail: [EMAIL PROTECTED] > > > For additional commands, e-mail: [EMAIL PROTECTED] > > > > > > > > > > > > --------------------------------------------------------------------- > > To unsubscribe, e-mail: [EMAIL PROTECTED] > > For additional commands, e-mail: [EMAIL PROTECTED] > > > > --------------------------------------------------------------------- > > To unsubscribe, e-mail: [EMAIL PROTECTED] > > For additional commands, e-mail: [EMAIL PROTECTED] > > > > -- > Dror Matalon > Zapatec Inc > 1700 MLK Way > Berkeley, CA 94709 > http://www.fastbuzz.com > http://www.zapatec.com > > --------------------------------------------------------------------- > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > > --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
