Hi chaps, Just looking for some ideas/experience as to how to improve our current architecture.
We have a single-index system containing approx. 2.5 million docs of about 1-3k each. The Lucene implementation is a daemon and it services requests on a port in multi-threaded manner, and it runs on a fairly new dual cpu box with 2G of ram. Although I have the jvm using ~1.5G, this system does fairly regularly crash with 'out of memory' errors. It's hard to see the exact conditions at that point as to cause, but I'm guessing it's simply a number of users executing queries which return large resultsets, and then require sorting (just about all queries are sorted by reverse date, using a field), so chewing up too much memory. This index is updated frequently, since it is a news site, so this makes the use of cacheing filters problematic. Typically about 1500 articles come in per day, and during working hours you'd see them popping in maybe every few seconds, with longer periods interspersed fairly randomly. Access to these new articles is expected to be 'immediate' for folks doing searches. The nature of this area is such that a great deal of activity focusses on 'recent' news, in particular the last 24 hours, then the last week, and perhaps the last month in that order. With that in mind I had the idea of creating a dual-index architecture "recent" and "archive", where the "recent" index holds approx. the most recent 30 days and the "archive" holds the rest. But there are several refinements on this, and I wondered if anyone else out there has already solved or at least tackled this problem and has any suggestions. For example, here is one idea for how the above might operate: At a defined point in time, the 30-day index is generated. For us this is easy. Our article bodies are all stored out on disk, timestamped, and we can simply generate a list newer than a certain date and index these to a brand new index. At the same time, the "archive" index is merged with the existing 30-day index, to make an updated "archive index. The system then operates by indexing to the 30-day index and directing searches to it where date-range is appropriate, otherwise to the archive index. We would then operate in this mode for a week or so before refreshing the indexes again. So searching and sorting would then mostly be done on an index which has around 45,000 docs in it rather than 2.5 million. I'm supposing that this will be massively faster to operate with both indexing and searching/sorting. Any comments from anyone on this would be very much appreciated. Cheers, Paul. --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
