Hi guys! I have noticed many questions on the list vonsidering Lucene sorting memory consumption and hope my solution can help someone.
I faced a memory/time consumption problem on sorting in Lucene back in April. With a help of this list's experts I came to solution which I like: documents from the sorting set (instead of given the field's values from the whole index) are lazy-cached in a WeakHashMap so the cached items are candidates for GC. I didn't create a patch yet (and I'm going to make it) but the code is ready to be reused and it's used for a while as a part of my open-source project, sharehound (http://sharehound.sourceforge.net). The code can be now reached by a CVS browser, it's in 4 classes in subdirectories of http://sharehound.cvs.sourceforge.net/sharehound/jNetCrawler/src/java/org/apache/lucene/. They (both classes, as a part of sharehound.jar, and sources) can also be downloaded with the latest (1.1.3 alpha) sharehound release zip file. LazyCachingSortFactory class have an example of use in its header comments, I duplicate it here: /** * Creates a Sort that doesn't use FieldCache and doesn't therefore load all the field values from the index while sorting. * Instead it uses CachingIndexReader (via CachingDocFieldComparatorSource) to cache documents from which field values * are fetched. If the caller Searcher is based on CachingIndexReader itself its document cache will be used here. * * * An example of use: ... hits = getQueryHits(query, getSort(listSorting.getSortFieldName(), listSorting.isSortDescending())); ... public Sort getSort(String sortFieldName, boolean sortDescending) { return LazyCachingSortFactory.create(sortFieldName, sortDescending); } ... protected Hits getQueryHits(Query query, Sort sort) throws IOException { IndexSearcher indexSearcher = new IndexSearcher(CachingIndexReader.decorateIfNeeded(IndexReader.open(getIndexDir()))); return indexSearcher.search(query, sort); } */ This solution does great work saving a memory and in case of relatively search resultsets it's almost as fast as the default implementation. Note that this solution is ready only for single-field sorting currently. Best regards, Artem OG> This is unlikely to work well/fast. It will depend on the OG> size of the index (not in terms of the number of docs, but its OG> physical size), the number of queries/second and desired query OG> latency. If you can wait 10 seconds to get a query and if only a OG> few queries are hitting the server at any one time, then you may OG> be Ok. Having things be up to date with non-relevancy sorting OG> will be quite tough. FieldCache will consume some RAM. Warming OG> it up will take some number of seconds. Re-opening an OG> IndexSearcher after index changes will also cost you a bit of time. OG> Consider a 64-bit server with more RAM that allowed larger OG> Java heaps, and try to fit your index into RAM. OG> Otis OG> ----- Original Message ---- OG> From: Mark Miller <[EMAIL PROTECTED]> OG> To: java-user@lucene.apache.org OG> Sent: Saturday, August 12, 2006 7:45:15 PM OG> Subject: Re: 30 milllion+ docs on a single server OG> The single server is important because I think it will take a lot of OG> work to scale it to multiple servers. The index must allow for close to OG> real-time updates and additions. It must also remain searchable at all OG> times (other than than during the brief period of single updates and OG> additions). If it is easy to scale this to multiple servers please tell OG> me how. OG> - Mark >> Why is a single server so important? I can scale horizontally much >> cheaper >> than I scale vertically. >> >> >> >> On 8/11/06, Mark Miller <[EMAIL PROTECTED]> wrote: >>> >>> I've made a nice little archive application with lucene. I made it to >>> handle our largest need: 2.5 million docs or so on a single server. Now >>> the powers that be say: lets use it for a 30+ million document archive >>> on a single server! (each doc size maybe 10k max...as small as a 1 or >>> 2k) Please tell me why we are in trouble...please tell me why we are >>> not. I have tested up to 2 million docs without much trouble but 30 >>> million...the average search will include a sort on a field as >>> well...can I search 30+ million docs with a sort? Man am I worried about >>> that. Maybe the server will have 8 procs and 12 billion gigs of RAM. >>> Mabye. Even still, Tomcat seems to be able to launch with a max of 1.5 >>> or 1.6 gig of Ram in Windows. What do you think? 30 million+ sounds like >>> too much of a load to me for a single server. Not that they care what I >>> think...I only wrote the thing (man I hate my job, offer me a new one :) >>> )...please...comments? >>> >>> Cheers, >>> >>> Miserable Mark >>> >>> --------------------------------------------------------------------- >>> To unsubscribe, e-mail: [EMAIL PROTECTED] >>> For additional commands, e-mail: [EMAIL PROTECTED] >>> >>> >> OG> --------------------------------------------------------------------- OG> To unsubscribe, e-mail: [EMAIL PROTECTED] OG> For additional commands, e-mail: [EMAIL PROTECTED] OG> --------------------------------------------------------------------- OG> To unsubscribe, e-mail: [EMAIL PROTECTED] OG> For additional commands, e-mail: [EMAIL PROTECTED] --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]