On Mon, Apr 7, 2014 at 8:35 PM, rahul challapalli < [email protected]> wrote:
> Hi, > > I want to refresh my understanding, so just a few imaginary situations. > Lets say I have a data ingestion of 25000 docs per second (average size > 10k). There could be situations where I want to optimize for indexing and > in some cases I want to optimize for speed while searching. How do I > control these individually? > This is the biggest challenge of any search solution. I would say that Blur handles this through utilizing the hardware it's given to the best of it's ability. That being said, 25K docs per second at 10K in size as a constant input everyday, data after day is a big problem to deal with on the search side of the house. That's north of 2 billion docs a day at over 20 TB a day. Some questions I would ask. Are you expecting to keep all of the data online forever? What are time to search requirements (visibility latency)? What kind of hardware are you expecting to run this on? All of this is assuming that the 25K/second is a constant average. If the 25K/second is a burst from time to time throughout the day, that is likely a much easier question to answer. In either case, if you need near real time access to the document (meaning you can't wait for a MapReduce job to run) then I would use the enqueueMutate call. It is similar to the NRT features of most search engines, it basically will index as fast as it can without causing a large latency on the client. > My understanding is that having fewer but bigger shards improves search > performance. Is this right? > In general yes, but fewer is relative. I have run table in Blur with over 1000 shards on more than 100 shard servers with segments in the 4K-5K (total) range and the search performance is very except-able. Of course the fewer the segments the faster the search executes. However as the segments grow in size the merges will take longer to complete. Take a look at the TeiredMergePolicy in Lucene, there are a few videos on youtube that show how merges occur. > Also does each shard correspond to one segment file (ignoring snapshots)? I > No, each shard equals a Lucene index, which will contain 1 or more segments. > am trying to understand what happens when a shard is being searched and > someone tries to write to the same shard. Would a new segment be created? > Yes, however Blur controls the way Lucene manages the segments within a given index. Basically Blur creates a lightweight snapshot of the index, then executes the query and fetches the results using this lightweight snapshot. > (if so how do we control merging of segments within a shard?) > All merges are handled by Lucene but Blur implements a shard merge policy globally per shard server so that the resources that merging consumes can be managed per process instead of per index/shard. Also there is merge throttling builtin so that you can control how much bandwidth each server takes up during merging. Of course this means that merging can fall behind the ingest process. This is ok, however if the situation remains forever the index will become slower and slower to search. Merging also uses the BlockCache for performance, but does not effect the contents of the BlockCache. > > My apologies if this doesn't make a whole lot of sense. > All good questions, let me know if you have more questions. Aaron > Thank You. > > - Rahul >
