Hi all,
 We use Lucene.net to index ~1.5 million documents per day (24hr cycle) which 
is expected to grow considerably. We have had to take a lot into consideration 
regarding sharding our indexes and teiring the shards - but it's time to evolve 
our design. Right now we are running into a few problem with performance as we 
try to keep each tier around 1-1.5Gigs (or roughly 300K documents per 
index).....this is starting to cause problems with search performance as poor 
ParallelMultiSearcher is faced with, in some cases, 42 indexes for a given 
shard (roughly 50gigs for that shards' indexes). Our "index nodes" only contain 
the index and search components while the physical indexes live elsewhere 
(which we will be changing). We have roughly .5TB of indexes and NRT indexing 
is an imaginary fairytale. Depending on traffic, indexing can be delayed by 
hours (the best we ever ran was < 10 minutes). So we have a lot of work to do 
;) - I can at least say that searches themselves are still sub-second post 
warming :)
 
Given that background, my question relates to dealing with indexing static vs 
dynamic data in an environment similar to mine. Everything we do thus far 
involves taking an immutable meta file and indexing it (these relate to various 
content types like mail, docs, media, etc) . We would like to start adding user 
generated content to these meta files and indexing that, but the user isn't 
going to want to wait for an hour to be able to search on some comment they 
made to a document. I've tampered with a few ideas, like separating indexes 
based on static vs dynamic data (basically field partitioning) and then using a 
SpanOrFilter against the main indexes or possibly merging the hits from two 
separate queries, but the problem is that users comment on tens to hundreds of 
thousands of documents at the same time (i.e. the same comment applied to 50K 
documents) - and in those scenarios, applying a filter runs at best 12 seconds 
(never mind how long it takes to index the dynamic data). I seriously don't 
think adding the dynamic data to the main indexes is going to be a solution 
either (even though searches would be fast)...........
 
so finally the question - What are the best practices on partitioning lots of 
1's and 0's that contain both static and dynamic data? Is anyone else out there 
facing similar challenges and would you mind sharing your approach? Anybody got 
some good article links (that don't include the Lucene FAQ, ImprovingIndexSpeed 
or ImprovingSearchSpeed)?
 
Thank You!

 

Reply via email to