Hi all,
We use Lucene.net to index ~1.5 million documents per day (24hr cycle) which
is expected to grow considerably. We have had to take a lot into consideration
regarding sharding our indexes and teiring the shards - but it's time to evolve
our design. Right now we are running into a few problem with performance as we
try to keep each tier around 1-1.5Gigs (or roughly 300K documents per
index).....this is starting to cause problems with search performance as poor
ParallelMultiSearcher is faced with, in some cases, 42 indexes for a given
shard (roughly 50gigs for that shards' indexes). Our "index nodes" only contain
the index and search components while the physical indexes live elsewhere
(which we will be changing). We have roughly .5TB of indexes and NRT indexing
is an imaginary fairytale. Depending on traffic, indexing can be delayed by
hours (the best we ever ran was < 10 minutes). So we have a lot of work to do
;) - I can at least say that searches themselves are still sub-second post
warming :)
Given that background, my question relates to dealing with indexing static vs
dynamic data in an environment similar to mine. Everything we do thus far
involves taking an immutable meta file and indexing it (these relate to various
content types like mail, docs, media, etc) . We would like to start adding user
generated content to these meta files and indexing that, but the user isn't
going to want to wait for an hour to be able to search on some comment they
made to a document. I've tampered with a few ideas, like separating indexes
based on static vs dynamic data (basically field partitioning) and then using a
SpanOrFilter against the main indexes or possibly merging the hits from two
separate queries, but the problem is that users comment on tens to hundreds of
thousands of documents at the same time (i.e. the same comment applied to 50K
documents) - and in those scenarios, applying a filter runs at best 12 seconds
(never mind how long it takes to index the dynamic data). I seriously don't
think adding the dynamic data to the main indexes is going to be a solution
either (even though searches would be fast)...........
so finally the question - What are the best practices on partitioning lots of
1's and 0's that contain both static and dynamic data? Is anyone else out there
facing similar challenges and would you mind sharing your approach? Anybody got
some good article links (that don't include the Lucene FAQ, ImprovingIndexSpeed
or ImprovingSearchSpeed)?
Thank You!