Grahem,

You case seen to be very extreme. I will share with you two ways we have
used to solve similar situation, when time between the information needs to
be available on the search results is not compatible with the indexing
cycle. This could be not the final answer for you, but can at least insight
the list:

* First situation : we created a table in a standard SQL database that holds
a indexing backlog. When a search occurs both sources are searched (lucene
indexes and SQL database). For sure the SQL database search has some
limitations over Lucene one, but at least there was no information gap to
the user.

* Second situation: We used a live incremental indexing. A continuos
application was being runned each one minute to check updates. When a update
occurs before one minute, the update sources sends a signal to updatecheck
application triggering the update. This creates a new index (refreshed),
just replacing the new/changed documents in the index. A second phase is to
optimize the index. Was challeging to make this things synchonized, but it
worked.

Regards,

Vitor

On Fri, Jun 5, 2009 at 2:10 PM, Grahem Cuthbertson <
[email protected]> wrote:

>
> Hi all,
>  We use Lucene.net to index ~1.5 million documents per day (24hr cycle)
> which is expected to grow considerably. We have had to take a lot into
> consideration regarding sharding our indexes and teiring the shards - but
> it's time to evolve our design. Right now we are running into a few problem
> with performance as we try to keep each tier around 1-1.5Gigs (or roughly
> 300K documents per index).....this is starting to cause problems with search
> performance as poor ParallelMultiSearcher is faced with, in some cases, 42
> indexes for a given shard (roughly 50gigs for that shards' indexes). Our
> "index nodes" only contain the index and search components while the
> physical indexes live elsewhere (which we will be changing). We have roughly
> .5TB of indexes and NRT indexing is an imaginary fairytale. Depending on
> traffic, indexing can be delayed by hours (the best we ever ran was < 10
> minutes). So we have a lot of work to do ;) - I can at least say that
> searches themselves are still sub-second post warming :)
>
> Given that background, my question relates to dealing with indexing static
> vs dynamic data in an environment similar to mine. Everything we do thus far
> involves taking an immutable meta file and indexing it (these relate to
> various content types like mail, docs, media, etc) . We would like to start
> adding user generated content to these meta files and indexing that, but the
> user isn't going to want to wait for an hour to be able to search on some
> comment they made to a document. I've tampered with a few ideas, like
> separating indexes based on static vs dynamic data (basically field
> partitioning) and then using a SpanOrFilter against the main indexes or
> possibly merging the hits from two separate queries, but the problem is that
> users comment on tens to hundreds of thousands of documents at the same time
> (i.e. the same comment applied to 50K documents) - and in those scenarios,
> applying a filter runs at best 12 seconds (never mind how long it takes to
> index the dynamic data). I seriously don't think adding the dynamic data to
> the main indexes is going to be a solution either (even though searches
> would be fast)...........
>
> so finally the question - What are the best practices on partitioning lots
> of 1's and 0's that contain both static and dynamic data? Is anyone else out
> there facing similar challenges and would you mind sharing your approach?
> Anybody got some good article links (that don't include the Lucene FAQ,
> ImprovingIndexSpeed or ImprovingSearchSpeed)?
>
> Thank You!
>
>
>

Reply via email to