Re: [jira] [Created] (BLUR-94) Optimize index in flight

Aaron McCurry Tue, 21 May 2013 20:06:15 -0700

Absolutely.

First let me describe a typical map reduce (MR) + Blur environment.  I
would recommend that you run 2 clusters, one MR + HDFS and the another
running Blur + HDFS.  Obviously if you want to load data into Blur via
thrift you do not need MR.  However if you are going to load data via MR
the goal is to move as much of the processing to the MR cluster and away
from the running Blur cluster.

So given that, the BlurOutputFormat will by default index the data locally,
as opposed to opening a HdfsDirectory on the remote Blur HDFS cluster.
 This minimizes the amount of network traffic to and from the Blur HDFS
cluster.  It does this because when Lucene is indexing it creates segments
based on an internal buffer and then flushes them to disk.  Once enough of
these segments have been created it will optimize them into a single larger
segment, called merging.  This optimizing phase occurs automatically in the
process of indexing with Lucene, but by default Lucene never fully
optimizes.  And it usually doesn't have to fully optimize to a single
segment to have excellent performance.  But the act of creating and merging
segments causes the same data to go across the wired multiple times if the
indexing locally setting is disabled.  Because if it's indexed locally it
will only have to travel across the network once, at the end of the job.

Hopefully that explains why we need index locally.

Now for the optimize in flight option.

Given that we are indexing locally, the local index is a whole index by
itself.  Which means it will likely have many segments.  Once the new index
created by the BlurOutputFormat is delivered to the Blur HDFS cluster it
will be added into the shard's index by calling
IndexWriter.addIndex(Directory).  This means that all the segments from the
new index will be added to the existing index.  So if the existing index
has 12 segments and the new index has 14, the combined index will have 12 +
14 = 26 segments.  This will likely kickoff a merge cycle in the Blur Shard
server.  If it was only a single shard in a table it probably wouldn't
cause any problems.  However it is very likely that this will occur in all
the shard of a table at the same time.  So the entire cluster would go into
a massive merge cycle.

And it gets worse if you use the setReducerMultipler setting because the
number of new indexes added to a shard in a table goes up by that
multipler.  And each one of those new indexes would have their own set of
segments.  So here is an example of the merge problem:

Given you have a table with 1024 shards and you run a indexing job with a
reducer multipler of 8.  The average number of segments per shard is 8 and
the average number of segments per reducer output of 6.  Then the total
number of segments after the MR job would go from (1024 * 8 = 8192) to
(1024 * 8 + (1024 * 6 * 8) = 57,344).  That's a big jump.

So the optimize in flight basically makes the output of every reducer
(BlurOutputFormat) always equal 1.  It does this by opening a IndexWriter
on the remote HDFS cluster and adding the local index to it by calling
IndexWriter.addIndex(IndexReader).  This will actually cause a multiple
segment reader to be merged while being written.  So the result is that
given the same senario as above,  the number of segments would go
from (1024 * 8 = 8192) to (1024 * 8 + (1024 * 1 * 8) = 16,384).  Which will
may or may not trigger merges to occur, but the indexes will be far more
optimized.

Sorry for the lengthy response but there is a lot going on during indexing
and I just wanted to make sure you had a good idea of how it all works
together.

Aaron

On Tue, May 21, 2013 at 10:22 PM, Gagan Juneja <[email protected]>wrote:

> For just curiosity Could you explain bit why do we need this?
>
> Regards,
> Gagan
>
> On Wednesday, May 22, 2013, Aaron McCurry (JIRA) wrote:
>
> >
> >      [
> >
> https://issues.apache.org/jira/browse/BLUR-94?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
> ]
> >
> > Aaron McCurry closed BLUR-94.
> > -----------------------------
> >
> >     Resolution: Fixed
> >
> >
> >
> https://git-wip-us.apache.org/repos/asf?p=incubator-blur.git;a=commit;h=0d23eddda70bbee34b2e9d38240b29825455448e
> >
> > > Optimize index in flight
> > > ------------------------
> > >
> > >                 Key: BLUR-94
> > >                 URL: https://issues.apache.org/jira/browse/BLUR-94
> > >             Project: Apache Blur
> > >          Issue Type: Improvement
> > >    Affects Versions: 0.1.5
> > >            Reporter: Aaron McCurry
> > >             Fix For: 0.1.5
> > >
> > >
> > > In the BlurOutputFormat, during the copy phase of the index (where the
> > locally indexed index is being copied to the remote hdfs), and an option
> to
> > optimize the index in flight.  This can be as simple as opening a writer
> on
> > the remote destination and calling addIndex(indexReader) on that index
> > writer.  This will require the local index to be opened by a
> > DirectoryReader.
> >
> > --
> > This message is automatically generated by JIRA.
> > If you think it was sent incorrectly, please contact your JIRA
> > administrators
> > For more information on JIRA, see:
> http://www.atlassian.com/software/jira
> >
>

Re: [jira] [Created] (BLUR-94) Optimize index in flight

Reply via email to