Absolutely. First let me describe a typical map reduce (MR) + Blur environment. I would recommend that you run 2 clusters, one MR + HDFS and the another running Blur + HDFS. Obviously if you want to load data into Blur via thrift you do not need MR. However if you are going to load data via MR the goal is to move as much of the processing to the MR cluster and away from the running Blur cluster.
So given that, the BlurOutputFormat will by default index the data locally, as opposed to opening a HdfsDirectory on the remote Blur HDFS cluster. This minimizes the amount of network traffic to and from the Blur HDFS cluster. It does this because when Lucene is indexing it creates segments based on an internal buffer and then flushes them to disk. Once enough of these segments have been created it will optimize them into a single larger segment, called merging. This optimizing phase occurs automatically in the process of indexing with Lucene, but by default Lucene never fully optimizes. And it usually doesn't have to fully optimize to a single segment to have excellent performance. But the act of creating and merging segments causes the same data to go across the wired multiple times if the indexing locally setting is disabled. Because if it's indexed locally it will only have to travel across the network once, at the end of the job. Hopefully that explains why we need index locally. Now for the optimize in flight option. Given that we are indexing locally, the local index is a whole index by itself. Which means it will likely have many segments. Once the new index created by the BlurOutputFormat is delivered to the Blur HDFS cluster it will be added into the shard's index by calling IndexWriter.addIndex(Directory). This means that all the segments from the new index will be added to the existing index. So if the existing index has 12 segments and the new index has 14, the combined index will have 12 + 14 = 26 segments. This will likely kickoff a merge cycle in the Blur Shard server. If it was only a single shard in a table it probably wouldn't cause any problems. However it is very likely that this will occur in all the shard of a table at the same time. So the entire cluster would go into a massive merge cycle. And it gets worse if you use the setReducerMultipler setting because the number of new indexes added to a shard in a table goes up by that multipler. And each one of those new indexes would have their own set of segments. So here is an example of the merge problem: Given you have a table with 1024 shards and you run a indexing job with a reducer multipler of 8. The average number of segments per shard is 8 and the average number of segments per reducer output of 6. Then the total number of segments after the MR job would go from (1024 * 8 = 8192) to (1024 * 8 + (1024 * 6 * 8) = 57,344). That's a big jump. So the optimize in flight basically makes the output of every reducer (BlurOutputFormat) always equal 1. It does this by opening a IndexWriter on the remote HDFS cluster and adding the local index to it by calling IndexWriter.addIndex(IndexReader). This will actually cause a multiple segment reader to be merged while being written. So the result is that given the same senario as above, the number of segments would go from (1024 * 8 = 8192) to (1024 * 8 + (1024 * 1 * 8) = 16,384). Which will may or may not trigger merges to occur, but the indexes will be far more optimized. Sorry for the lengthy response but there is a lot going on during indexing and I just wanted to make sure you had a good idea of how it all works together. Aaron On Tue, May 21, 2013 at 10:22 PM, Gagan Juneja <[email protected]>wrote: > For just curiosity Could you explain bit why do we need this? > > Regards, > Gagan > > On Wednesday, May 22, 2013, Aaron McCurry (JIRA) wrote: > > > > > [ > > > https://issues.apache.org/jira/browse/BLUR-94?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel > ] > > > > Aaron McCurry closed BLUR-94. > > ----------------------------- > > > > Resolution: Fixed > > > > > > > https://git-wip-us.apache.org/repos/asf?p=incubator-blur.git;a=commit;h=0d23eddda70bbee34b2e9d38240b29825455448e > > > > > Optimize index in flight > > > ------------------------ > > > > > > Key: BLUR-94 > > > URL: https://issues.apache.org/jira/browse/BLUR-94 > > > Project: Apache Blur > > > Issue Type: Improvement > > > Affects Versions: 0.1.5 > > > Reporter: Aaron McCurry > > > Fix For: 0.1.5 > > > > > > > > > In the BlurOutputFormat, during the copy phase of the index (where the > > locally indexed index is being copied to the remote hdfs), and an option > to > > optimize the index in flight. This can be as simple as opening a writer > on > > the remote destination and calling addIndex(indexReader) on that index > > writer. This will require the local index to be opened by a > > DirectoryReader. > > > > -- > > This message is automatically generated by JIRA. > > If you think it was sent incorrectly, please contact your JIRA > > administrators > > For more information on JIRA, see: > http://www.atlassian.com/software/jira > > >
