We're getting up there in terms of corpus size for our Lucene indexing application: * 20 million documents * all fields need to be stored * 10 short fields / document * 1 long free text field / document (analyzed with a custom shingle-based analyzer) * 140GB total index size * Optimized into a single segment * Must run over NFS due to VMWare setup
I think I've already taken the most common steps to reduce memory requirements and increase performance on the searching side including: * omitting norms on all fields except two * omitting term vectors * indexing as few fields as possible * reusing a single searcher * splitting the index up into N shards for ParallelMultiSearcher The application will run with 10G of -Xmx but any less and it bails out. It seems happier if we feed it 12GB. The searches are starting to bog down a bit (5-10 seconds for some queries)... Our next step was to deploy the shards as RemoteSearchables for the same ParallelMultiSearcher over RMI - but before I do that I'm curious: * are there other ways to get that memory usage down? * are there performance optimizations that I haven't thought of? Thanks, -Chris --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
