On Sat, 2011-09-03 at 20:09 +0200, Michael Bell wrote: > To be exact, there are about 300 million documents. This is running on a 64 > bit JVM/64 bit OS with 24 GB(!) RAM allocated.
How much memory is allocated to the JVM? > Now, their searches are working fine IF you do not SORT the results. If you > do SORT, you get stuff like > > 2011-08-30 13:01:31,489 [TP-Processor8] ERROR > com.gwava.utils.ServerErrorHandlerStrategy - reportError: nastybadthing :: > com.gwava.indexing.lucene.internal.LuceneSearchController.performSearchOperation:229 > :: EXCEPTION : java.lang.OutOfMemoryError: Requested array size exceeds VM > limit java.lang.OutOfMemoryError: Requested array size exceeds VM limit > at > org.apache.lucene.search.FieldCacheImpl$StringIndexCache.createValue(FieldCacheImpl.java:624) [...] > Looking at the sort class, the api docs appear to say it would create an > element of 1.2 billion items (4*300m). The StringIndexCache in Lucene 3 keeps two arrays in memory: int[#docs] and String[#docs+1]. With 300M documents that is 1.2 billion bytes for the int-array, which should not be a problem for the machine. Unfortunately the String-array is a big problem. Keeping in mind that a String in Java takes up approximately 50 + 2 * length bytes and setting the average length of the terms to 10 chars, the array takes up a maximum of 300M * (50 + 2 * 10) byte = 21,000 MByte or about 20 GByte. In reality it is not that bad as duplicates only count once, but the problem should be obvious. > Is this correct? Is the issue going beyond signed int32 limits of an array ( > 2 billion items) or is it really a memory issue? How best to diagnose? Open your index with Luke and count the number of unique terms for your sort field. Using the formula above, you'll get an estimate of the memory required for sorting on String in Lucene 3. The int32 limit is only for the number of unique terms and there is a maximum of one term/document when sorting. With 300M documents there's a lot of room before that will be a problem. If your field is numeric, changing the sort type should solve your problem. If you really are comparing Strings, it is not so easy. Lucene 4 is unfortunately not ready for production, but it has huge improvements with regard to memory usage on sorting. If you are feeling adventurous, you can take a look at https://issues.apache.org/jira/browse/LUCENE-2369 which drastically reduces the memory needed for sorting. An experiment with 200M unique terms required 1,7 GByte with the trade-off that it took 8 minutes to open the index. One of the earlier patches works against Lucene 3, while the later ones are Lucene 4 only. --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org