Re: IndexWriter.close() performance issue

Mark Kristensson Thu, 18 Nov 2010 16:10:05 -0800

I finally bucked up and made the change to CheckIndex to verify that I do not, 
in fact, have any fields with norms in this index. The result is below - the 
largest segment currently is #3, which 300,000+ fields but no norms.


-Mark



Segments file=segments_acew numSegments=9 version=FORMAT_DIAGNOSTICS [Lucene 
2.9]
  1 of 9: name=_bfkv docCount=8642
    compound=true
    hasProx=true
    numFiles=2
    size (MB)=7.921
    diagnostics = {optimize=false, mergeFactor=1, 
os.version=2.6.18-194.26.1.el5, os=Linux, mergeDocStores=true, 
lucene.version=3.0.0 883080 - 2009-11-22 15:43:58, source=merge, os.arch=amd64, 
java.version=1.6.0_20, java.vendor=Sun Microsystems Inc.}
    has deletions [delFileName=_bfkv_1.del]
    test: open reader.........OK [77 deleted docs]
    test: fields..............OK [114226 fields]
    test: field norms.........OK [0 fields]
    test: terms, freq, prox...OK [103996 terms; 926779 terms/docs pairs; 919464 
tokens]
    test: stored fields.......OK [202850 total field count; avg 23.684 fields 
per doc]
    test: term vectors........OK [0 total vector count; avg 0 term/freq vector 
fields per doc]

  2 of 9: name=_1gi5 docCount=0
    compound=true
    hasProx=true
    numFiles=1
    size (MB)=0.001
    diagnostics = {optimize=false, mergeFactor=1, 
os.version=2.6.18-128.7.1.el5, os=Linux, mergeDocStores=true, 
lucene.version=3.0.0 883080 - 2009-11-22 15:43:58, source=merge, os.arch=amd64, 
java.version=1.6.0_20, java.vendor=Sun Microsystems Inc.}
    no deletions
    test: open reader.........OK
    test: fields..............OK [28 fields]
    test: field norms.........OK [0 fields]
    test: terms, freq, prox...OK [0 terms; 0 terms/docs pairs; 0 tokens]
    test: stored fields.......OK [0 total field count; avg � fields per doc]
    test: term vectors........OK [0 total vector count; avg � term/freq vector 
fields per doc]

  3 of 9: name=_bfkw docCount=6433351
    compound=true
    hasProx=true
    numFiles=2
    size (MB)=3,969.392
    diagnostics = {optimize=false, mergeFactor=10, 
os.version=2.6.18-194.26.1.el5, os=Linux, mergeDocStores=true, 
lucene.version=3.0.0 883080 - 2009-11-22 15:43:58, source=merge, os.arch=amd64, 
java.version=1.6.0_20, java.vendor=Sun Microsystems Inc.}
    has deletions [delFileName=_bfkw_s7.del]
    test: open reader.........OK [89111 deleted docs]
    test: fields..............OK [308832 fields]
    test: field norms.........OK [0 fields]
    test: terms, freq, prox...OK [47362222 terms; 733184933 terms/docs pairs; 
720556927 tokens]
    test: stored fields.......OK [186735038 total field count; avg 29.434 
fields per doc]
    test: term vectors........OK [0 total vector count; avg 0 term/freq vector 
fields per doc]

  4 of 9: name=_bglk docCount=100296
    compound=true
    hasProx=true
    numFiles=2
    size (MB)=83.448
    diagnostics = {optimize=false, mergeFactor=10, 
os.version=2.6.18-194.26.1.el5, os=Linux, mergeDocStores=true, 
lucene.version=3.0.0 883080 - 2009-11-22 15:43:58, source=merge, os.arch=amd64, 
java.version=1.6.0_20, java.vendor=Sun Microsystems Inc.}
    has deletions [delFileName=_bglk_1p.del]
    test: open reader.........OK [7027 deleted docs]
    test: fields..............OK [19192 fields]
    test: field norms.........OK [0 fields]
    test: terms, freq, prox...OK [1342162 terms; 13987377 terms/docs pairs; 
13126384 tokens]
    test: stored fields.......OK [3713794 total field count; avg 39.818 fields 
per doc]
    test: term vectors........OK [0 total vector count; avg 0 term/freq vector 
fields per doc]

  5 of 9: name=_bglt docCount=3123
    compound=true
    hasProx=true
    numFiles=2
    size (MB)=1.999
    diagnostics = {optimize=false, mergeFactor=10, 
os.version=2.6.18-194.26.1.el5, os=Linux, mergeDocStores=true, 
lucene.version=3.0.0 883080 - 2009-11-22 15:43:58, source=merge, os.arch=amd64, 
java.version=1.6.0_20, java.vendor=Sun Microsystems Inc.}
    has deletions [delFileName=_bglt_q.del]
    test: open reader.........OK [878 deleted docs]
    test: fields..............OK [911 fields]
    test: field norms.........OK [0 fields]
    test: terms, freq, prox...OK [28803 terms; 345429 terms/docs pairs; 218626 
tokens]
    test: stored fields.......OK [73229 total field count; avg 32.619 fields 
per doc]
    test: term vectors........OK [0 total vector count; avg 0 term/freq vector 
fields per doc]

  6 of 9: name=_bgme docCount=2339
    compound=true
    hasProx=true
    numFiles=2
    size (MB)=1.704
    diagnostics = {optimize=false, mergeFactor=10, 
os.version=2.6.18-194.26.1.el5, os=Linux, mergeDocStores=true, 
lucene.version=3.0.0 883080 - 2009-11-22 15:43:58, source=merge, os.arch=amd64, 
java.version=1.6.0_20, java.vendor=Sun Microsystems Inc.}
    has deletions [delFileName=_bgme_h.del]
    test: open reader.........OK [329 deleted docs]
    test: fields..............OK [1122 fields]
    test: field norms.........OK [0 fields]
    test: terms, freq, prox...OK [30846 terms; 316451 terms/docs pairs; 272709 
tokens]
    test: stored fields.......OK [69847 total field count; avg 34.75 fields per 
doc]
    test: term vectors........OK [0 total vector count; avg 0 term/freq vector 
fields per doc]

  7 of 9: name=_bgnj docCount=2941
    compound=true
    hasProx=true
    numFiles=2
    size (MB)=2.2
    diagnostics = {optimize=false, mergeFactor=10, 
os.version=2.6.18-194.26.1.el5, os=Linux, mergeDocStores=true, 
lucene.version=3.0.0 883080 - 2009-11-22 15:43:58, source=merge, os.arch=amd64, 
java.version=1.6.0_20, java.vendor=Sun Microsystems Inc.}
    has deletions [delFileName=_bgnj_d.del]
    test: open reader.........OK [527 deleted docs]
    test: fields..............OK [1846 fields]
    test: field norms.........OK [0 fields]
    test: terms, freq, prox...OK [42379 terms; 412630 terms/docs pairs; 341300 
tokens]
    test: stored fields.......OK [83805 total field count; avg 34.716 fields 
per doc]
    test: term vectors........OK [0 total vector count; avg 0 term/freq vector 
fields per doc]

  8 of 9: name=_bgo4 docCount=3899
    compound=true
    hasProx=true
    numFiles=1
    size (MB)=2.988
    diagnostics = {optimize=false, mergeFactor=10, 
os.version=2.6.18-194.26.1.el5, os=Linux, mergeDocStores=true, 
lucene.version=3.0.0 883080 - 2009-11-22 15:43:58, source=merge, os.arch=amd64, 
java.version=1.6.0_20, java.vendor=Sun Microsystems Inc.}
    no deletions
    test: open reader.........OK
    test: fields..............OK [1367 fields]
    test: field norms.........OK [0 fields]
    test: terms, freq, prox...OK [52773 terms; 505461 terms/docs pairs; 505461 
tokens]
    test: stored fields.......OK [160630 total field count; avg 41.198 fields 
per doc]
    test: term vectors........OK [0 total vector count; avg 0 term/freq vector 
fields per doc]

  9 of 9: name=_bgo5 docCount=4
    compound=true
    hasProx=true
    numFiles=1
    size (MB)=0.007
    diagnostics = {os.version=2.6.18-194.26.1.el5, os=Linux, 
lucene.version=3.0.0 883080 - 2009-11-22 15:43:58, source=flush, os.arch=amd64, 
java.version=1.6.0_20, java.vendor=Sun Microsystems Inc.}
    no deletions
    test: open reader.........OK
    test: fields..............OK [87 fields]
    test: field norms.........OK [0 fields]
    test: terms, freq, prox...OK [298 terms; 440 terms/docs pairs; 440 tokens]
    test: stored fields.......OK [95 total field count; avg 23.75 fields per 
doc]
    test: term vectors........OK [0 total vector count; avg 0 term/freq vector 
fields per doc]



On Nov 17, 2010, at 1:51 PM, Michael McCandless wrote:

> Lucene interns field names... since you have a truly enormous number
> of unique fields it's expected intern will be called alot.
> 
> But that said it's odd that it's this costly.
> 
> Can you post the stack traces that call intern?
> 
> Mike
> 
> On Fri, Nov 5, 2010 at 1:53 PM, Michael McCandless
> <[email protected]> wrote:
>> Hmm...
>> 
>> So, I was going on this output from your CheckIndex:
>> 
>>   test: field norms.........OK [296713 fields]
>> 
>> But in fact I just looked and that number is bogus -- it's always
>> equal to total number of fields, not number of fields with norms
>> enabled.  I'll open an issue to fix this, but in the meantime can you
>> apply this patch to your CheckIndex and run it again?
>> 
>> Index: src/java/org/apache/lucene/index/CheckIndex.java
>> ===================================================================
>> --- src/java/org/apache/lucene/index/CheckIndex.java    (revision 1031678)
>> +++ src/java/org/apache/lucene/index/CheckIndex.java    (working copy)
>> @@ -570,8 +570,10 @@
>>       }
>>       final byte[] b = new byte[reader.maxDoc()];
>>       for (final String fieldName : fieldNames) {
>> -        reader.norms(fieldName, b, 0);
>> -        ++status.totFields;
>> +        if (reader.hasNorms(fieldName)) {
>> +          reader.norms(fieldName, b, 0);
>> +          ++status.totFields;
>> +        }
>>       }
>> 
>>       msg("OK [" + status.totFields + " fields]");
>> 
>> So if in fact you have already disabled norms then something else is
>> the source of the sudden slowness.  Though, such a huge number of
>> unique field names is not an area of Lucene that's very well tested...
>> perhaps there's something silly somewhere.  Maybe you can try
>> profiling just the init of your IndexReader?  (Eg, run java with
>> -agentlib:hprof=cpu=samples,depth=16,interval=1).
>> 
>> Yes, both Index.NOT_ANALYZED_NO_NORMS and Index.NO will disable norms
>> as long as no document in the index ever had norms on (yes it does
>> "infect" heh).
>> 
>> Mike
>> 
>> On Fri, Nov 5, 2010 at 1:37 PM, Mark Kristensson
>> <[email protected]> wrote:
>>> While most of our Lucene indexes are used for more traditional searching, 
>>> this index in particular is used more like a reporting repository. Thus, we 
>>> really do need to have that many fields indexed and they do need to be 
>>> broken out into separate fields. There may be another way to structure the 
>>> index to reduce the number of fields, but I'm hoping we can optimize the 
>>> current design and avoid (yet another) index redesign.
>>> 
>>> I'll look into the tweaking the merge policy, but I'm more interested in 
>>> disabling norms because scoring really doesn't matter for us. Basically, we 
>>> need nothing more than a binary answer from Lucene: either a record meets 
>>> the provided criteria (which can be a rather complex boolean query with 
>>> many subqueries) or it doesn't. If the record does match, then we get the 
>>> IDs from lucene and run off to get the live data from our primary data 
>>> store and sort it (in Java) based upon criteria provided by the user, not 
>>> by score.
>>> 
>>> After our initial design mushroomed in size, we redesigned and now (I 
>>> thought) do not have norms on any of the fields in this index. So, I'm 
>>> wondering if there was something in the results from the CheckIndex that I 
>>> provided which indicates to you that we may have norms still enabled? I 
>>> know that if you have norms on any one document's field, then any other 
>>> document with that same field will get "infected" with norms as well.
>>> 
>>> My understanding is that any field that uses the constants  
>>> Index.NOT_ANALYZED_NO_NORMS or  Index.NO will not  have norms on it, 
>>> regardless of whether or not the field is stored. Is that not correct?
>>> 
>>> Thanks,
>>> Mark
>>> 
>>> 
>>> 
>>> On Nov 4, 2010, at 2:56 AM, Michael McCandless wrote:
>>> 
>>>> Likely what happened is you had a bunch of smaller segments, and then
>>>> suddenly they got merged into that one big segment (_aiaz) in your
>>>> index.
>>>> 
>>>> The representation for norms in particular is not sparse, so this
>>>> means the size of the norms file for a given segment will be
>>>> number-of-unique-indexed-fields X number-of-documents.
>>>> 
>>>> So this count grows quadratically on merge.
>>>> 
>>>> Do these fields really need to be indexed?   If so, it'd be better to
>>>> use a single field for all users for the indexable text if you can.
>>>> 
>>>> Failing that, a simple workaround is to set the maxMergeMB/Docs on the
>>>> merge policy; this'd prevent big segments from being produced.
>>>> Disabling norms should also workaround this, though that will affect
>>>> hit scores...
>>>> 
>>>> Mike
>>>> 
>>>> On Wed, Nov 3, 2010 at 7:37 PM, Mark Kristensson
>>>> <[email protected]> wrote:
>>>>> Yes, we do have a large number of unique field names in that index, 
>>>>> because they are driven by user named fields in our application (with 
>>>>> some cleaning to remove illegal chars).
>>>>> 
>>>>> This slowness problem has appeared very suddenly in the last couple of 
>>>>> weeks and the number of unique field names has not spiked in the last few 
>>>>> weeks. Have we crept over some threshold with our linear growth in the 
>>>>> number of unique field names? Perhaps there is a limit driven by the 
>>>>> amount of RAM in the machine that we are violating? Are there any 
>>>>> guidelines for the maximum number, or suggested number, of unique fields 
>>>>> names in an index or segment? Any suggestions for potentially mitigating 
>>>>> the problem?
>>>>> 
>>>>> Thanks,
>>>>> Mark
>>>>> 
>>>>> 
>>>>> On Nov 3, 2010, at 2:02 PM, Michael McCandless wrote:
>>>>> 
>>>>>> On Wed, Nov 3, 2010 at 4:27 PM, Mark Kristensson
>>>>>> <[email protected]> wrote:
>>>>>>> 
>>>>>>> I've run checkIndex against the index and the results are below. That 
>>>>>>> net is that it's telling me nothing is wrong with the index.
>>>>>> 
>>>>>> Thanks.
>>>>>> 
>>>>>>> I did not have any instrumentation around the opening of the 
>>>>>>> IndexSearcher (we don't use an IndexReader), just around the actual 
>>>>>>> query execution so I had to add some additional logging. What I found 
>>>>>>> surprised me, opening a search against this index takes the same 6 to 8 
>>>>>>> seconds that closing the indexWriter takes.
>>>>>> 
>>>>>> IndexWriter opens a SegmentReader for each segment in the index, to
>>>>>> apply deletions, so I think this is the source of the slowness.
>>>>>> 
>>>>>> From the CheckIndex output, it looks like you have many (296,713)
>>>>>> unique fields names on that one large segment -- does that sound
>>>>>> right?  I suspect such a very high field count is the source of the
>>>>>> slowness...
>>>>>> 
>>>>>> Mike
>>>>>> 
>>>>>> ---------------------------------------------------------------------
>>>>>> To unsubscribe, e-mail: [email protected]
>>>>>> For additional commands, e-mail: [email protected]
>>>>>> 
>>>>> 
>>>>> 
>>>>> ---------------------------------------------------------------------
>>>>> To unsubscribe, e-mail: [email protected]
>>>>> For additional commands, e-mail: [email protected]
>>>>> 
>>>>> 
>>>> 
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe, e-mail: [email protected]
>>>> For additional commands, e-mail: [email protected]
>>>> 
>>> 
>>> 
>> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [email protected]
> For additional commands, e-mail: [email protected]
> 


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: IndexWriter.close() performance issue

Reply via email to