On Wed, Dec 10, 2014 at 3:46 PM, Tom Burton-West <tburt...@umich.edu> wrote: > Thanks Robert, > > With indexes close to 1 TB in size, I/O is usually our big bottleneck. > > Can you point me to where in the 4.x codebase and/or 5.x codebase I should > look to get a feel for what you mean by i/o locality? Or should I be > looking at a JIRA issue? > is there a short explanation you might be able to supply?
Start at SegmentMerger in both places. In 4.10.x you can see how it just validates every part of every reader in a naive loop: https://github.com/apache/lucene-solr/blob/lucene_solr_4_10/lucene/core/src/java/org/apache/lucene/index/SegmentMerger.java#L58 in 5.x it is not done with this loop, instead responsibility for the merge is in the codec API. So this is done "fine-grained" for each part of the index, for example in stored fields, we verify each reader's stored fields portion right before we merge it in that individual piece: https://github.com/apache/lucene-solr/blob/branch_5x/lucene/core/src/java/org/apache/lucene/codecs/StoredFieldsWriter.java#L82 Note the default codec optimizes merge() more for stored fields and term vectors with a bulk byte copy that verifies as it copies. This bulk copy case is the typical case, when you aren't "upgrading" old segments, using something like SortingMergePolicy, etc: https://github.com/apache/lucene-solr/blob/branch_5x/lucene/core/src/java/org/apache/lucene/codecs/compressing/CompressingStoredFieldsWriter.java#L355 > > Tom > > > > On Wed, Dec 10, 2014 at 3:31 PM, Robert Muir <rcm...@gmail.com> wrote: >> >> There are two costs: cpu and i/o. >> >> The cpu cost is not much anyway but can be made basically trivial by >> using java 8. >> The i/o cost is because the check is not done with any i/o locality to >> the data being merged. so it could be a perf hit for an extremely >> large merge. >> >> In 5.0 the option is removed: we reworked this computation in merging >> to always have locality and so on, the checking always happens. >> >> On Wed, Dec 10, 2014 at 2:51 PM, Tom Burton-West <tburt...@umich.edu> >> wrote: >> > Hello all, >> > >> > In the example solrconfig.xml file for Solr 4.10.2 there is the comment >> > (appended below) that says that setting checkIntegrityAtMerge to true >> > reduces risk of index corruption at the expense of slower merging. >> > >> > Can someone please point me to any benchmarks or details about the >> > trade-offs? What kind of a slowdown occurs and what are the factors >> > affecting the magnitude of the slowdown? >> > >> > I have huge indexes with huge merges, so I would really love to enable >> > integrity checking. On the other hand, we have very rarely ever had a >> > problem with a corrupt index and we allways do checkIndexes at the end >> > of >> > the indexing process when we are re-indexing the entire corpus. >> > >> > I'd like to get some kind of understanding of how much this will cost us >> > in >> > merge speeds since re-indexing our corpus takes about 10 days and much >> > of >> > that time is spent on merging. >> > >> > We index 13 millon books (nearly 4 billion pages) averaging about >> > 100,000 >> > tokens/book. We now have about 1 millon books per shard. Merging >> > 30,000 >> > volumes takes about 30 minutes, with larger merges taking longer.) >> > >> > >> > <!-- >> > Use true to enable this safety check, which can help >> > reduce the risk of propagating index corruption from older >> > segments >> > into new ones, at the expense of slower merging. >> > --> >> > <checkIntegrityAtMerge>false</checkIntegrityAtMerge> >> > >> > Tom Burton-West >> > http://www.hathitrust.org/blogs/Large-scale-Search >> >> --------------------------------------------------------------------- >> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org >> For additional commands, e-mail: dev-h...@lucene.apache.org >> > --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org