I've indexed 200K docs, fields indexed as ANALYZED (which include norms), but the fields were sparse. The "holes" I've seen were thousands (sometimes even 80K). Now that I understand this better, I realize that particular indexing code is incorrect, and I should have disabled NORMS. After I did it, performance really improved.
So if judging by the buggy indexing code, this fix is not needed. And I guess large "holes" really represent a bug, rather than a common scenario. So I take this proposal back :). The code I've used is from benchmark, TrecContentSource, which takes all the <meta> tags from the HTML files and puts them as properties on DocData, and DocMaker later on adds them to the Document. That's what created the sparseness. I think I'm going to add two things to benchmark: 1. Add a doc.tokenized.norms property and if set to false, it will use Index.ANALYZED_NO_NORMS or Index.NOT_ANALYZED_NO_NORMS 2. Add to TrecContentSource a keep.properties attribute, which if set to false will set DocData.props to null. I think for TREC, it really doesn't make sense to index all the <meta> tags. Shai On Mon, Jun 22, 2009 at 5:10 PM, Michael McCandless < luc...@mikemccandless.com> wrote: > This code isn't invoked that often, I believe. It only happens when > there are "holes" in the norms between docs, ie you have a field that > has norms enabled (at least one Document had this Field w/ norms > enabled in the past), but then you had a series of Docs that had > disabled norms for the field and so you must fill the hole (since > norms aren't sparse). > > So I think in practice it won't help much? (And, writing long series > of the same byte is something in general we shouldn't "try" to do ;) > So I'm not sure I want a public API "inviting" it). > > Mike > > On Mon, Jun 22, 2009 at 9:04 AM, Shai Erera<ser...@gmail.com> wrote: > > I'm testing the performance of some indexing code and noticed that > > NormsWriter.flush() calls IndexOutput.writeByte(defaultNorm) in a loop, > > writing the same norm every time (lines: 139-140, 157-158, 162-163). > > > > In the run I've spotted it, it occurs few thousands of times (I mean few > > thousands of writeByte calls). > > > > I was thinking that if we had writeByte(byte b, int lenght) in > IndexOutput, > > we can call it once and handle it effeciently where possible. For > > back-compat, the default impl would just be looping and calling > > writeByte(b), but for others, like BufferedIndexOutout, this could be > > filling the array with b, length times. We won't use System.arraycopy > which > > is faster, but won't call thousands of times to writeByte either. > > > > What do you think? > > > > Shai > > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org > For additional commands, e-mail: java-dev-h...@lucene.apache.org > >