Hi Pete. Many thanks for this advice. It would seem that perhaps a cluster would best solve this and then spread over some number of lower end servers. From what i read on large indexing, this seems to be the approach (but with as much RAM as possible per server). I am looking at costs so the lower end 2GB RAM servers are attractive but just use more of them.

I have only used pylucene for tests on smaller indexes. Is a cluster arrangement possible using pylucene? I am not a java programmer so would like to stay with what I know. Many thanks.

Regards,
David

Pete wrote:
On Thursday April 5 2007 9:33 am, David Pratt wrote:
I realize that the amount of RAM needed will be based on the size of the
index, how many documents and what you are storing in the index itself -
but some anecdotal information would be helpful. I am looking at an
index that could reach 20 - 50 million documents. Will a commodity
server with 2Gb be enough?

IIRC, it's more a function of how quickly you're adding data than total size. Though this may be incorrect when merging segments (aka optimizing). A fast disk helps quite a lot too. You'll want to configure the IndexWriter for bulk loading. The relevant items are setMergeFactor, which controls how often segments are merged on disk, and setMaxBufferedDocs, which controls how many docs are held in RAM before being written out. A higher value for both will be faster, though be aware that an index build with a high merge factor is slower to query, so you'd probably want to optimize() at the end. On our indexing server, with ~4kb documents, setMaxBufferedDocs(200) uses about 700MB of RAM. See the Javadocs & Lucene In Action for more details.

On the searching front, a dedicated commodity box w/ 2 GB can probably serve around 2 million documents (again, depending on document size). Multiple CPUs will let you serve more simultaneous queries.

I guess it is possible to build a test index with sample data to
determine this also. Many thanks.

You should probably ask the Lucene list, but please report any test results here as well (you could put them on the wiki too).

_______________________________________________
pylucene-dev mailing list
[email protected]
http://lists.osafoundation.org/mailman/listinfo/pylucene-dev

Reply via email to