On Wed, Dec 10, 2014 at 3:46 PM, Tom Burton-West <tburt...@umich.edu> wrote:
> Thanks Robert,
>
> With indexes close to 1 TB in size, I/O is usually our big bottleneck.
>
> Can you point me to where in the 4.x codebase and/or 5.x codebase I should
> look to get a feel for what you mean by i/o locality?  Or should I be
> looking at a JIRA issue?
> is there a short explanation you might be able to supply?

Start at SegmentMerger in both places.

In 4.10.x you can see how it just validates every part of every reader
in a naive loop:
https://github.com/apache/lucene-solr/blob/lucene_solr_4_10/lucene/core/src/java/org/apache/lucene/index/SegmentMerger.java#L58

in 5.x it is not done with this loop, instead responsibility for the
merge is in the codec API.
So this is done "fine-grained" for each part of the index, for example
in stored fields, we verify each reader's stored fields portion right
before we merge it in that individual piece:
https://github.com/apache/lucene-solr/blob/branch_5x/lucene/core/src/java/org/apache/lucene/codecs/StoredFieldsWriter.java#L82

Note the default codec optimizes merge() more for stored fields and
term vectors with a bulk byte copy that verifies as it copies.
This bulk copy case is the typical case, when you aren't "upgrading"
old segments, using something like SortingMergePolicy, etc:
https://github.com/apache/lucene-solr/blob/branch_5x/lucene/core/src/java/org/apache/lucene/codecs/compressing/CompressingStoredFieldsWriter.java#L355

>
> Tom
>
>
>
> On Wed, Dec 10, 2014 at 3:31 PM, Robert Muir <rcm...@gmail.com> wrote:
>>
>> There are two costs: cpu and i/o.
>>
>> The cpu cost is not much anyway but can be made basically trivial by
>> using java 8.
>> The i/o cost is because the check is not done with any i/o locality to
>> the data being merged. so it could be a perf hit for an extremely
>> large merge.
>>
>> In 5.0 the option is removed: we reworked this computation in merging
>> to always have locality and so on, the checking always happens.
>>
>> On Wed, Dec 10, 2014 at 2:51 PM, Tom Burton-West <tburt...@umich.edu>
>> wrote:
>> > Hello all,
>> >
>> > In the example solrconfig.xml file for Solr 4.10.2 there is the comment
>> > (appended below) that says that  setting checkIntegrityAtMerge to true
>> > reduces risk of index corruption at the expense of slower merging.
>> >
>> > Can someone please point me to any benchmarks or details about the
>> > trade-offs?   What kind of a slowdown occurs and what are the factors
>> > affecting the magnitude of the slowdown?
>> >
>> > I have huge indexes with huge merges, so  I would really love to enable
>> > integrity checking.  On the other hand, we have very rarely ever had a
>> > problem with a corrupt index and we allways do checkIndexes  at the end
>> > of
>> > the indexing process  when we are re-indexing the entire corpus.
>> >
>> > I'd like to get some kind of understanding of how much this will cost us
>> > in
>> > merge speeds since re-indexing our corpus takes about 10 days and much
>> > of
>> > that time is spent on merging.
>> >
>> > We index 13 millon books (nearly 4 billion pages) averaging about
>> > 100,000
>> > tokens/book.  We now have about 1 millon books per shard.   Merging
>> > 30,000
>> > volumes takes about  30 minutes, with larger merges taking longer.)
>> >
>> >
>> >   <!--
>> >         Use true to enable this safety check, which can help
>> >         reduce the risk of propagating index corruption from older
>> > segments
>> >         into new ones, at the expense of slower merging.
>> >     -->
>> >      <checkIntegrityAtMerge>false</checkIntegrityAtMerge>
>> >
>> > Tom Burton-West
>> > http://www.hathitrust.org/blogs/Large-scale-Search
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
>> For additional commands, e-mail: dev-h...@lucene.apache.org
>>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Reply via email to