Thanks Aaron for your prompt and elaborate responses. You have really been a pleasure to collaborate with and always encouraging to ask more questions (however dumb they are). Below are a few comments
On Mon, Apr 7, 2014 at 7:09 PM, Aaron McCurry <[email protected]> wrote: > On Mon, Apr 7, 2014 at 8:35 PM, rahul challapalli < > [email protected]> wrote: > > > Hi, > > > > I want to refresh my understanding, so just a few imaginary situations. > > Lets say I have a data ingestion of 25000 docs per second (average size > > 10k). There could be situations where I want to optimize for indexing and > > in some cases I want to optimize for speed while searching. How do I > > control these individually? > > > > This is the biggest challenge of any search solution. I would say that > Blur handles this through utilizing the hardware it's given to the best of > it's ability. That being said, 25K docs per second at 10K in size as a > constant input everyday, data after day is a big problem to deal with on > the search side of the house. That's north of 2 billion docs a day at over > 20 TB a day. Some questions I would ask. Are you expecting to keep all of > the data online forever? What are time to search requirements (visibility > latency)? What kind of hardware are you expecting to run this on? All of > this is assuming that the 25K/second is a constant average. If the > 25K/second is a burst from time to time throughout the day, that is likely > a much easier question to answer. > > In either case, if you need near real time access to the document (meaning > you can't wait for a MapReduce job to run) then I would use the > enqueueMutate call. It is similar to the NRT features of most search > engines, it basically will index as fast as it can without causing a large > latency on the client. > As I told you its just an imaginary situation, however my intention was that we would have data ingestion bursts lasting for an hour and happening 3 times a day. I just wanted to understand how fastly the new data will be available for search and the performance of the search when the data ingestion is taking place. I did not really consider the hardware (May be a 10gig network and each node having 128GB memory?) > > > > My understanding is that having fewer but bigger shards improves search > > performance. Is this right? > > > > In general yes, but fewer is relative. I have run table in Blur with over > 1000 shards on more than 100 shard servers with segments in the 4K-5K > (total) range and the search performance is very except-able. Of course > the fewer the segments the faster the search executes. However as the > segments grow in size the merges will take longer to complete. Take a look > at the TeiredMergePolicy in Lucene, there are a few videos on youtube that > show how merges occur. > > > > Also does each shard correspond to one segment file (ignoring > snapshots)? I > > > > No, each shard equals a Lucene index, which will contain 1 or more > segments. > > > > am trying to understand what happens when a shard is being searched and > > someone tries to write to the same shard. Would a new segment be created? > > > > Yes, however Blur controls the way Lucene manages the segments within a > given index. Basically Blur creates a lightweight snapshot of the index, > then executes the query and fetches the results using this lightweight > snapshot. > > > > (if so how do we control merging of segments within a shard?) > > > > All merges are handled by Lucene but Blur implements a shard merge policy > globally per shard server so that the resources that merging consumes can > be managed per process instead of per index/shard. Also there is merge > throttling builtin so that you can control how much bandwidth each server > takes up during merging. Of course this means that merging can fall behind > the ingest process. This is ok, however if the situation remains forever > the index will become slower and slower to search. Merging also uses the > BlockCache for performance, but does not effect the contents of the > BlockCache. > Can you point me to the code that is taking care of this? (SharedMergeScheduler?) > > > > > > My apologies if this doesn't make a whole lot of sense. > > > > All good questions, let me know if you have more questions. > > Aaron > > > > Thank You. > > > > - Rahul > > >
