Re: Optimizing for indexing vs searching

rahul challapalli Tue, 08 Apr 2014 07:56:27 -0700

Thanks Aaron for your prompt and elaborate responses. You have really been
a pleasure to collaborate with and always encouraging to ask more questions
(however dumb they are). Below are a few comments



On Mon, Apr 7, 2014 at 7:09 PM, Aaron McCurry <[email protected]> wrote:

> On Mon, Apr 7, 2014 at 8:35 PM, rahul challapalli <
> [email protected]> wrote:
>
> > Hi,
> >
> > I want to refresh my understanding, so just a few imaginary situations.
> > Lets say I have a data ingestion of 25000 docs per second (average size
> > 10k). There could be situations where I want to optimize for indexing and
> > in some cases I want to optimize for speed while searching. How do I
> > control these individually?
> >
>
> This is the biggest challenge of any search solution.  I would say that
> Blur handles this through utilizing the hardware it's given to the best of
> it's ability.  That being said, 25K docs per second at 10K in size as a
> constant input everyday, data after day is a big problem to deal with on
> the search side of the house.  That's north of 2 billion docs a day at over
> 20 TB a day.  Some questions I would ask.  Are you expecting to keep all of
> the data online forever?  What are time to search requirements (visibility
> latency)?  What kind of hardware are you expecting to run this on?  All of
> this is assuming that the 25K/second is a constant average.  If the
> 25K/second is a burst from time to time throughout the day, that is likely
> a much easier question to answer.
>
> In either case, if you need near real time access to the document (meaning
> you can't wait for a MapReduce job to run) then I would use the
> enqueueMutate call.  It is similar to the NRT features of most search
> engines, it basically will index as fast as it can without causing a large
> latency on the client.
>

   As I told you its just an imaginary situation, however my intention was
that we would have data ingestion bursts lasting for an hour and happening 3
 times a day. I just wanted to understand how fastly the new data will be
available for search and the performance of the search when the data
ingestion is taking place. I did not really consider the hardware (May be a
10gig network and each node having 128GB memory?)

>
>
> > My understanding is that having fewer but bigger shards improves search
> > performance. Is this right?
> >
>
> In general yes, but fewer is relative.  I have run table in Blur with over
> 1000 shards on more than 100 shard servers with segments in the 4K-5K
> (total) range and the search performance is very except-able.  Of course
> the fewer the segments the faster the search executes.  However as the
> segments grow in size the merges will take longer to complete.  Take a look
> at the TeiredMergePolicy in Lucene, there are a few videos on youtube that
> show how merges occur.
>
>
> > Also does each shard correspond to one segment file (ignoring
> snapshots)? I
> >
>
> No, each shard equals a Lucene index, which will contain 1 or more
> segments.
>
>
> > am trying to understand what happens when a shard is being searched and
> > someone tries to write to the same shard. Would a new segment be created?
> >
>
> Yes, however Blur controls the way Lucene manages the segments within a
> given index.  Basically Blur creates a lightweight snapshot of the index,
> then executes the query and fetches the results using this lightweight
> snapshot.
>
>
> > (if so how do we control merging of segments within a shard?)
> >
>
> All merges are handled by Lucene but Blur implements a shard merge policy
> globally per shard server so that the resources that merging consumes can
> be managed per process instead of per index/shard.  Also there is merge
> throttling builtin so that you can control how much bandwidth each server
> takes up during merging.  Of course this means that merging can fall behind
> the ingest process.  This is ok, however if the situation remains forever
> the index will become slower and slower to search.  Merging also uses the
> BlockCache for performance, but does not effect the contents of the
> BlockCache.
>

  Can you point me to the code that is taking care of this?
(SharedMergeScheduler?)

>
>
> >
> > My apologies if this doesn't make a whole lot of sense.
> >
>
> All good questions, let me know if you have more questions.
>
> Aaron
>
>
> > Thank You.
> >
> > - Rahul
> >
>

Re: Optimizing for indexing vs searching

Reply via email to