Re: Indexing a large number of DB records

Homam S.A. Tue, 14 Dec 2004 21:35:26 -0800

Thanks Otis!

What do you mean by building it in batches? Does it
mean I should close the IndexWriter every 1000 rows
and reopen it? Does that releases references to the
document objects so that they can be
garbage-collected?


I'm calling optimize() only at the end.

I agree that 1500 documents is very small. I'm
building the index on a PC with 512 megs, and the
indexing process is quickly gobbling up around 400
megs when I index around 1800 documents and the whole
machine is grinding to a virtual halt. I'm using the
latest DotLucene .NET port, so may be there's a memory
leak in it.

I have experience with AltaVista search (acquired by
FastSearch), and I used to call MakeStable() every
20,000 documents to flush memory structures to disk.
There doesn't seem to be an equivalent in Lucene.

-- Homam






--- Otis Gospodnetic <[EMAIL PROTECTED]>
wrote:

> Hello,
> 
> There are a few things you can do:
> 
> 1) Don't just pull all rows from the DB at once.  Do
> that in batches.
> 
> 2) If you can get a Reader from your SqlDataReader,
> consider this:
>
http://jakarta.apache.org/lucene/docs/api/org/apache/lucene/document/Field.html#Text(java.lang.String,%20java.io.Reader)
> 
> 3) Give the JVM more memory to play with by using
> -Xms and -Xmx JVM
> parameters
> 
> 4) See IndexWriter's minMergeDocs parameter.
> 
> 5) Are you calling optimize() at some point by any
> chance?  Leave that
> call for the end.
> 
> 1500 documents with 30 columns of short
> String/number values is not a
> lot.  You may be doing something else not Lucene
> related that's slowing
> things down.
> 
> Otis
> 
> 
> --- "Homam S.A." <[EMAIL PROTECTED]> wrote:
> 
> > I'm trying to index a large number of records from
> the
> > DB (a few millions). Each record will be stored as
> a
> > document with about 30 fields, most of them are
> > UnStored and represent small strings or numbers.
> No
> > huge DB Text fields.
> > 
> > But I'm running out of memory very fast, and the
> > indexing is slowing down to a crawl once I hit
> around
> > 1500 records. The problem is each document is
> holding
> > references to the string objects returned from
> > ToString() on the DB field, and the IndexWriter is
> > holding references to all these document objects
> in
> > memory, so the garbage collector is getting a
> chance
> > to clean these up.
> > 
> > How do you guys go about indexing a large DB
> table?
> > Here's a snippet of my code (this method is called
> for
> > each record in the DB):
> > 
> > private void IndexRow(SqlDataReader rdr,
> IndexWriter
> > iw) {
> >     Document doc = new Document();
> >     for (int i = 0; i < BrowseFieldNames.Length; i++)
> {
> >             doc.Add(Field.UnStored(BrowseFieldNames[i],
> > rdr.GetValue(i).ToString()));
> >     }
> >     iw.AddDocument(doc);
> > }
> > 
> > 
> > 
> > 
> >             
> > __________________________________ 
> > Do you Yahoo!? 
> > Yahoo! Mail - Find what you need with new enhanced
> search.
> > http://info.mail.yahoo.com/mail_250
> > 
> >
>
---------------------------------------------------------------------
> > To unsubscribe, e-mail:
> [EMAIL PROTECTED]
> > For additional commands, e-mail:
> [EMAIL PROTECTED]
> > 
> > 
> 
> 
>
---------------------------------------------------------------------
> To unsubscribe, e-mail:
> [EMAIL PROTECTED]
> For additional commands, e-mail:
> [EMAIL PROTECTED]
> 
> 



                
__________________________________ 
Do you Yahoo!? 
Take Yahoo! Mail with you! Get it on your mobile phone. 
http://mobile.yahoo.com/maildemo 

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Indexing a large number of DB records

Reply via email to