Hi Chuck:
Thanks for your help and the info.
By some experimentation, I found when calling
FSWriter.addIndex(ramDirectory), it is actually performing a merge
with the existing index. So doing 2000 batches of 500, when the index
grows after each batch, the time to do the merge increases.
I guess in this implementation, doing it this way is not optimal.
Thanks
-John
On Sat, 27 Nov 2004 13:14:31 -0800, Chuck Williams <[EMAIL PROTECTED]> wrote:
> Hi John,
>
> I don't use a RamDirectory and so don't have the answer for you. There
> have been a number of messages about RamDirectory performance on
> lucene-user, including some reported benchmarks. Some people have
> reported a significant benefit from RamDirectory's, but most others have
> seen little or no benefit. I'm not sure which factors indicate the
> nature or magnitude of impact. You sent the message below just to me
> -- you might want to post a question on lucene-user.
>
> I've included a couple messages below on the subject that I saved.
>
> Chuck
>
> Included messages:
>
> -----Original Message-----
> From: Jonathan Hager [mailto:[EMAIL PROTECTED]
> Sent: Wednesday, November 24, 2004 2:27 PM
> To: Lucene Users List
> Subject: Re: Index in RAM - is it realy worthy?
>
> When comparing RAMDirectory and FSDirectory it is important to mention
> what OS you are using. When using linux it will cache the most recent
> disk access in memory. Here is a good article that describes its
> strategy: http://forums.gentoo.org/viewtopic.php?t=175419
>
> The 2% difference you are seeing is the memory copy. With other OSes
> you may see a speed up when using the RAMDirectory, because not all
> OSes contain a disk cache in memory and must access the disk to read
> the index.
>
> Another consideration is there is currently a 2GB limitation with the
> size of the RAMDirectory. Indexes over 2GB causes a overflow in the
> int used to create the buffer. [see int len = (int) is.length(); in
> RamDirectory]
>
> I ended up using RAM directory for a very different reason. The index
> is 1 to 2MB and is rebuilt every few hours. It takes 3 to 4 minutes
> to query the database and rebuild the index. But the search should be
> available 100% of the time. Since the index is so small I do the
> following:
>
> on server startup:
> - look for semaphore, if it is there delete the index
> - if there is no index, build it to FSdirectory
> - load the index from FSDirectory into RAMDirectory
>
> on reindex:
> - create semaphore
> - rebuild index to FSDirectory
> - delete semaphore
> - load index from FSDirecttory into RAMDirectory
>
> to search:
> - search the RAMDirectory
>
> RAMDirectory could be replaced by a regular FSDirectory, but it seemed
> silly to copy the index from disk to disk, when it ultimately needs to
> be in memory.
>
> FSDirectory could be replaced by a RAMDirectory, but this means that
> it would take the server 3 to 4 minutes longer to startup every time.
> By persisting the index, this time would only be necessary if indexing
> was interrupted.
>
> Jonathan
>
> On Mon, 22 Nov 2004 12:39:07 -0800, Kevin A. Burton
> <[EMAIL PROTECTED]> wrote:
> > Otis Gospodnetic wrote:
> >
> > >For the Lucene book I wrote some test cases that compare FSDirectory
> > >and RAMDirectory. What I found was that with certain settings
> > >FSDirectory was almost as fast as RAMDirectory. Personally, I would
> > >push FSDirectory and hope that the OS and the Filesystem do their
> share
> > >of work and caching for me before looking for ways to optimize my
> code.
> > >
> > >
> > Yes... I performed the same benchmark and in my situation RAMDirectory
> > for searches was about 2% slower.
> >
> > I'm willing to bet that it has to do with the fact that its a
> Hashtable
> > and not a HashMap (which isn't synchronized).
> >
> > Also adding a constructor for the term size could make loading a
> > RAMDirectory faster since you could prevent rehash.
> >
> > If you're on a modern machine your filesystme cache will end up
> > buffering your disk anyway which I'm sure was happening in my
> situation.
> >
> > Kevin
> >
> > --
> >
> > Use Rojo (RSS/Atom aggregator). Visit http://rojo.com. Ask me for an
> > invite! Also see irc.freenode.net #rojo if you want to chat.
> >
> > Rojo is Hiring! - http://www.rojonetworks.com/JobsAtRojo.html
> >
> > If you're interested in RSS, Weblogs, Social Networking, etc... then
> you
> > should work for Rojo! If you recommend someone and we hire them
> you'll
> > get a free iPod!
> >
> > Kevin A. Burton, Location - San Francisco, CA
> > AIM/YIM - sfburtonator, Web - http://peerfear.org/
> > GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412
> >
> >
> >
> >
> > ---------------------------------------------------------------------
>
>
> > To unsubscribe, e-mail: [EMAIL PROTECTED]
> > For additional commands, e-mail: [EMAIL PROTECTED]
> >
> >
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
>
> -----Original Message-----
> From: John Wang [mailto:[EMAIL PROTECTED]
> Sent: Monday, November 22, 2004 12:35 PM
> To: Lucene Users List
> Subject: Re: Index in RAM - is it realy worthy?
>
> In my test, I have 12900 documents. Each document is small, a few
> discreet fields (KeyWord type) and 1 Text field containing only 1
> sentence.
>
> with both mergeFactor and maxMergeDocs being 1000
>
> using RamDirectory, the indexing job took about 9.2 seconds
>
> not using RamDirectory, the indexing job took about 122 seconds.
>
> I am not calling optimize.
>
> This is on windows Xp running java 1.5.
>
> Is there something very wrong or different in my setup to cause such a
> big different?
>
> Thanks
>
> -John
>
> On Mon, 22 Nov 2004 09:23:40 -0800 (PST), Otis Gospodnetic
> <[EMAIL PROTECTED]> wrote:
> > For the Lucene book I wrote some test cases that compare FSDirectory
> > and RAMDirectory. What I found was that with certain settings
> > FSDirectory was almost as fast as RAMDirectory. Personally, I would
> > push FSDirectory and hope that the OS and the Filesystem do their
> share
> > of work and caching for me before looking for ways to optimize my
> code.
> >
> > Otis
> >
> >
> >
> > --- [EMAIL PROTECTED] wrote:
> >
> > >
> > > I did following test:
> > > I created the RAM folder on my Red Hat box and copied c. 1Gb of
> > > indexes
> > > there.
> > > I expected the queries to run much quicker.
> > > In reality it was even sometimes slower(sic!)
> > >
> > > Lucene has it's own RAM disk functionality. If I implement it, would
> > > it
> > > bring any benefits?
> > >
> > > Thanks in advance
> > > J.
> >
> > ---------------------------------------------------------------------
>
>
> > To unsubscribe, e-mail: [EMAIL PROTECTED]
> > For additional commands, e-mail: [EMAIL PROTECTED]
> >
> >
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
>
> > -----Original Message-----
> > From: John Wang [mailto:[EMAIL PROTECTED]
> > Sent: Saturday, November 27, 2004 11:50 AM
> > To: Chuck Williams
> > Subject: Re: URGENT: Help indexing large document set
> >
> > I found the reason for the degredation. It is because I was writing
> to
> > a RamDirectory and then adding to a FSWriter. I guess it makes sense
> > since the addIndex call would slow down as the index grows.
> >
> > I guess it is not a good idea to use RamDirectory if there are many
> > small batches. Are there some performace numbers that would tell me
> > when to/not to use a RamDirectory?
> >
> > thanks
> >
> > -John
> >
> >
> > On Wed, 24 Nov 2004 15:23:49 -0800, John Wang <[EMAIL PROTECTED]>
> > wrote:
> > > Hi Chuck:
> > >
> > > The reason I am not using localReader.delete(term) is because
> I
> > > have some logic to check whether to delete the term based on a
> flag.
> > >
> > > I am testing with the keys to be sorted.
> > >
> > > I am not doing anything weird, just committing a batch of 500
> > > documents to the index of 2000 batches. I don't what why it is
> having
> > > this linear slow down...
> > >
> > >
> > >
> > > Thanks
> > >
> > > -John
> > >
> > > On Wed, 24 Nov 2004 12:32:52 -0800, Chuck Williams
> <[EMAIL PROTECTED]>
> > wrote:
> > > > Does keyIter return the keys in sorted order? This should
> reduce
> > seeks,
> > > > especially if the keys are dense.
> > > >
> > > > Also, you should be able to localReader.delete(term) instead of
> > > > iterating over the docs (of which I presume there is only one
> doc
> > since
> > > > keys are unique). This won't improve performance as
> > > > IndexReader.delete(Term) does exactly what your code does, but
> it
> > will
> > > > be cleaner.
> > > >
> > > > A linear slowdown with number of docs doesn't make sense, so
> > something
> > > > else must be wrong. I'm not sure what the default buffer size
> is
> > (it
> > > > appears it used to be 128 but is dynamic now I think). You
> might
> > find
> > > > the slowdown stops after a certain point, especially if you
> increase
> > > > your batch size.
> > > >
> > > >
> > > >
> > > > Chuck
> > > >
> > > > > -----Original Message-----
> > > > > From: John Wang [mailto:[EMAIL PROTECTED]
> > > > > Sent: Wednesday, November 24, 2004 12:21 PM
> > > > > To: Lucene Users List
> > > > > Subject: Re: URGENT: Help indexing large document set
> > > > >
> > > > > Thanks Paul!
> > > > >
> > > > > Using your suggestion, I have changed the update check code
> to
> > use
> > > > > only the indexReader:
> > > > >
> > > > > try {
> > > > > localReader = IndexReader.open(path);
> > > > >
> > > > > while (keyIter.hasNext()) {
> > > > > key = (String) keyIter.next();
> > > > > term = new Term("key", key);
> > > > > TermDocs tDocs = localReader.termDocs(term);
> > > > > if (tDocs != null) {
> > > > > try {
> > > > > while (tDocs.next()) {
> > > > > localReader.delete(tDocs.doc());
> > > > > }
> > > > > } finally {
> > > > > tDocs.close();
> > > > > }
> > > > > }
> > > > > }
> > > > > } finally {
> > > > >
> > > > > if (localReader != null) {
> > > > > localReader.close();
> > > > > }
> > > > >
> > > > > }
> > > > >
> > > > >
> > > > > Unfortunately it didn't seem to make any dramatic difference.
> > > > >
> > > > > I also see the CPU is only 30-50% busy, so I am guessing it's
> > > > spending
> > > > > a lot of time in IO. Anyway of making the CPU work harder?
> > > > >
> > > > > Is batch size of 500 too small for 1 million documents?
> > > > >
> > > > > Currently I am seeing a linear speed degredation of 0.3
> > milliseconds
> > > > > per document.
> > > > >
> > > > > Thanks
> > > > >
> > > > > -John
> > > > >
> > > > >
> > > > > On Wed, 24 Nov 2004 09:05:39 +0100, Paul Elschot
> > > > > <[EMAIL PROTECTED]> wrote:
> > > > > > On Wednesday 24 November 2004 00:37, John Wang wrote:
> > > > > >
> > > > > >
> > > > > > > Hi:
> > > > > > >
> > > > > > > I am trying to index 1M documents, with batches of 500
> > > > documents.
> > > > > > >
> > > > > > > Each document has an unique text key, which is added
> as a
> > > > > > > Field.KeyWord(name,value).
> > > > > > >
> > > > > > > For each batch of 500, I need to make sure I am not
> adding
> > a
> > > > > > > document with a key that is already in the current index.
> > > > > > >
> > > > > > > To do this, I am calling IndexSearcher.docFreq for each
> > > > document
> > > > > and
> > > > > > > delete the document currently in the index with the same
> key:
> > > > > > >
> > > > > > > while (keyIter.hasNext()) {
> > > > > > > String objectID = (String) keyIter.next();
> > > > > > > term = new Term("key", objectID);
> > > > > > > int count = localSearcher.docFreq(term);
> > > > > >
> > > > > > To speed this up a bit make sure that the iterator gives
> > > > > > the terms in sorted order. I'd use an index reader instead
> > > > > > of a searcher, but that will probably not make a
> difference.
> > > > > >
> > > > > > Adding the documents can be done with multiple threads.
> > > > > > Last time I checked that, there was a moderate speed up
> > > > > > using three threads instead of one on a single CPU machine.
> > > > > > Tuning the values of minMergeDocs and maxMergeDocs
> > > > > > may also help to increase performance of adding documents.
> > > > > >
> > > > > > Regards,
> > > > > > Paul Elschot
> > > > > >
> > > > > >
> > > >
> --------------------------------------------------------------------
> > -
> > > > > >
> > > > > >
> > > > > > To unsubscribe, e-mail: lucene-user-
> > [EMAIL PROTECTED]
> > > > > > For additional commands, e-mail:
> > > > [EMAIL PROTECTED]
> > > > > >
> > > > > >
> > > > >
> > > > >
> > > >
> --------------------------------------------------------------------
> > -
> > > >
> > > >
> > > > > To unsubscribe, e-mail: lucene-user-
> > [EMAIL PROTECTED]
> > > > > For additional commands, e-mail: lucene-user-
> > [EMAIL PROTECTED]
> > > >
> > > >
> > >
>
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]