I finally bucked up and made the change to CheckIndex to verify that I do not,
in fact, have any fields with norms in this index. The result is below - the
largest segment currently is #3, which 300,000+ fields but no norms.
-Mark
Segments file=segments_acew numSegments=9 version=FORMAT_DIAGNOSTICS [Lucene
2.9]
1 of 9: name=_bfkv docCount=8642
compound=true
hasProx=true
numFiles=2
size (MB)=7.921
diagnostics = {optimize=false, mergeFactor=1,
os.version=2.6.18-194.26.1.el5, os=Linux, mergeDocStores=true,
lucene.version=3.0.0 883080 - 2009-11-22 15:43:58, source=merge, os.arch=amd64,
java.version=1.6.0_20, java.vendor=Sun Microsystems Inc.}
has deletions [delFileName=_bfkv_1.del]
test: open reader.........OK [77 deleted docs]
test: fields..............OK [114226 fields]
test: field norms.........OK [0 fields]
test: terms, freq, prox...OK [103996 terms; 926779 terms/docs pairs; 919464
tokens]
test: stored fields.......OK [202850 total field count; avg 23.684 fields
per doc]
test: term vectors........OK [0 total vector count; avg 0 term/freq vector
fields per doc]
2 of 9: name=_1gi5 docCount=0
compound=true
hasProx=true
numFiles=1
size (MB)=0.001
diagnostics = {optimize=false, mergeFactor=1,
os.version=2.6.18-128.7.1.el5, os=Linux, mergeDocStores=true,
lucene.version=3.0.0 883080 - 2009-11-22 15:43:58, source=merge, os.arch=amd64,
java.version=1.6.0_20, java.vendor=Sun Microsystems Inc.}
no deletions
test: open reader.........OK
test: fields..............OK [28 fields]
test: field norms.........OK [0 fields]
test: terms, freq, prox...OK [0 terms; 0 terms/docs pairs; 0 tokens]
test: stored fields.......OK [0 total field count; avg � fields per doc]
test: term vectors........OK [0 total vector count; avg � term/freq vector
fields per doc]
3 of 9: name=_bfkw docCount=6433351
compound=true
hasProx=true
numFiles=2
size (MB)=3,969.392
diagnostics = {optimize=false, mergeFactor=10,
os.version=2.6.18-194.26.1.el5, os=Linux, mergeDocStores=true,
lucene.version=3.0.0 883080 - 2009-11-22 15:43:58, source=merge, os.arch=amd64,
java.version=1.6.0_20, java.vendor=Sun Microsystems Inc.}
has deletions [delFileName=_bfkw_s7.del]
test: open reader.........OK [89111 deleted docs]
test: fields..............OK [308832 fields]
test: field norms.........OK [0 fields]
test: terms, freq, prox...OK [47362222 terms; 733184933 terms/docs pairs;
720556927 tokens]
test: stored fields.......OK [186735038 total field count; avg 29.434
fields per doc]
test: term vectors........OK [0 total vector count; avg 0 term/freq vector
fields per doc]
4 of 9: name=_bglk docCount=100296
compound=true
hasProx=true
numFiles=2
size (MB)=83.448
diagnostics = {optimize=false, mergeFactor=10,
os.version=2.6.18-194.26.1.el5, os=Linux, mergeDocStores=true,
lucene.version=3.0.0 883080 - 2009-11-22 15:43:58, source=merge, os.arch=amd64,
java.version=1.6.0_20, java.vendor=Sun Microsystems Inc.}
has deletions [delFileName=_bglk_1p.del]
test: open reader.........OK [7027 deleted docs]
test: fields..............OK [19192 fields]
test: field norms.........OK [0 fields]
test: terms, freq, prox...OK [1342162 terms; 13987377 terms/docs pairs;
13126384 tokens]
test: stored fields.......OK [3713794 total field count; avg 39.818 fields
per doc]
test: term vectors........OK [0 total vector count; avg 0 term/freq vector
fields per doc]
5 of 9: name=_bglt docCount=3123
compound=true
hasProx=true
numFiles=2
size (MB)=1.999
diagnostics = {optimize=false, mergeFactor=10,
os.version=2.6.18-194.26.1.el5, os=Linux, mergeDocStores=true,
lucene.version=3.0.0 883080 - 2009-11-22 15:43:58, source=merge, os.arch=amd64,
java.version=1.6.0_20, java.vendor=Sun Microsystems Inc.}
has deletions [delFileName=_bglt_q.del]
test: open reader.........OK [878 deleted docs]
test: fields..............OK [911 fields]
test: field norms.........OK [0 fields]
test: terms, freq, prox...OK [28803 terms; 345429 terms/docs pairs; 218626
tokens]
test: stored fields.......OK [73229 total field count; avg 32.619 fields
per doc]
test: term vectors........OK [0 total vector count; avg 0 term/freq vector
fields per doc]
6 of 9: name=_bgme docCount=2339
compound=true
hasProx=true
numFiles=2
size (MB)=1.704
diagnostics = {optimize=false, mergeFactor=10,
os.version=2.6.18-194.26.1.el5, os=Linux, mergeDocStores=true,
lucene.version=3.0.0 883080 - 2009-11-22 15:43:58, source=merge, os.arch=amd64,
java.version=1.6.0_20, java.vendor=Sun Microsystems Inc.}
has deletions [delFileName=_bgme_h.del]
test: open reader.........OK [329 deleted docs]
test: fields..............OK [1122 fields]
test: field norms.........OK [0 fields]
test: terms, freq, prox...OK [30846 terms; 316451 terms/docs pairs; 272709
tokens]
test: stored fields.......OK [69847 total field count; avg 34.75 fields per
doc]
test: term vectors........OK [0 total vector count; avg 0 term/freq vector
fields per doc]
7 of 9: name=_bgnj docCount=2941
compound=true
hasProx=true
numFiles=2
size (MB)=2.2
diagnostics = {optimize=false, mergeFactor=10,
os.version=2.6.18-194.26.1.el5, os=Linux, mergeDocStores=true,
lucene.version=3.0.0 883080 - 2009-11-22 15:43:58, source=merge, os.arch=amd64,
java.version=1.6.0_20, java.vendor=Sun Microsystems Inc.}
has deletions [delFileName=_bgnj_d.del]
test: open reader.........OK [527 deleted docs]
test: fields..............OK [1846 fields]
test: field norms.........OK [0 fields]
test: terms, freq, prox...OK [42379 terms; 412630 terms/docs pairs; 341300
tokens]
test: stored fields.......OK [83805 total field count; avg 34.716 fields
per doc]
test: term vectors........OK [0 total vector count; avg 0 term/freq vector
fields per doc]
8 of 9: name=_bgo4 docCount=3899
compound=true
hasProx=true
numFiles=1
size (MB)=2.988
diagnostics = {optimize=false, mergeFactor=10,
os.version=2.6.18-194.26.1.el5, os=Linux, mergeDocStores=true,
lucene.version=3.0.0 883080 - 2009-11-22 15:43:58, source=merge, os.arch=amd64,
java.version=1.6.0_20, java.vendor=Sun Microsystems Inc.}
no deletions
test: open reader.........OK
test: fields..............OK [1367 fields]
test: field norms.........OK [0 fields]
test: terms, freq, prox...OK [52773 terms; 505461 terms/docs pairs; 505461
tokens]
test: stored fields.......OK [160630 total field count; avg 41.198 fields
per doc]
test: term vectors........OK [0 total vector count; avg 0 term/freq vector
fields per doc]
9 of 9: name=_bgo5 docCount=4
compound=true
hasProx=true
numFiles=1
size (MB)=0.007
diagnostics = {os.version=2.6.18-194.26.1.el5, os=Linux,
lucene.version=3.0.0 883080 - 2009-11-22 15:43:58, source=flush, os.arch=amd64,
java.version=1.6.0_20, java.vendor=Sun Microsystems Inc.}
no deletions
test: open reader.........OK
test: fields..............OK [87 fields]
test: field norms.........OK [0 fields]
test: terms, freq, prox...OK [298 terms; 440 terms/docs pairs; 440 tokens]
test: stored fields.......OK [95 total field count; avg 23.75 fields per
doc]
test: term vectors........OK [0 total vector count; avg 0 term/freq vector
fields per doc]
On Nov 17, 2010, at 1:51 PM, Michael McCandless wrote:
> Lucene interns field names... since you have a truly enormous number
> of unique fields it's expected intern will be called alot.
>
> But that said it's odd that it's this costly.
>
> Can you post the stack traces that call intern?
>
> Mike
>
> On Fri, Nov 5, 2010 at 1:53 PM, Michael McCandless
> <[email protected]> wrote:
>> Hmm...
>>
>> So, I was going on this output from your CheckIndex:
>>
>> test: field norms.........OK [296713 fields]
>>
>> But in fact I just looked and that number is bogus -- it's always
>> equal to total number of fields, not number of fields with norms
>> enabled. I'll open an issue to fix this, but in the meantime can you
>> apply this patch to your CheckIndex and run it again?
>>
>> Index: src/java/org/apache/lucene/index/CheckIndex.java
>> ===================================================================
>> --- src/java/org/apache/lucene/index/CheckIndex.java (revision 1031678)
>> +++ src/java/org/apache/lucene/index/CheckIndex.java (working copy)
>> @@ -570,8 +570,10 @@
>> }
>> final byte[] b = new byte[reader.maxDoc()];
>> for (final String fieldName : fieldNames) {
>> - reader.norms(fieldName, b, 0);
>> - ++status.totFields;
>> + if (reader.hasNorms(fieldName)) {
>> + reader.norms(fieldName, b, 0);
>> + ++status.totFields;
>> + }
>> }
>>
>> msg("OK [" + status.totFields + " fields]");
>>
>> So if in fact you have already disabled norms then something else is
>> the source of the sudden slowness. Though, such a huge number of
>> unique field names is not an area of Lucene that's very well tested...
>> perhaps there's something silly somewhere. Maybe you can try
>> profiling just the init of your IndexReader? (Eg, run java with
>> -agentlib:hprof=cpu=samples,depth=16,interval=1).
>>
>> Yes, both Index.NOT_ANALYZED_NO_NORMS and Index.NO will disable norms
>> as long as no document in the index ever had norms on (yes it does
>> "infect" heh).
>>
>> Mike
>>
>> On Fri, Nov 5, 2010 at 1:37 PM, Mark Kristensson
>> <[email protected]> wrote:
>>> While most of our Lucene indexes are used for more traditional searching,
>>> this index in particular is used more like a reporting repository. Thus, we
>>> really do need to have that many fields indexed and they do need to be
>>> broken out into separate fields. There may be another way to structure the
>>> index to reduce the number of fields, but I'm hoping we can optimize the
>>> current design and avoid (yet another) index redesign.
>>>
>>> I'll look into the tweaking the merge policy, but I'm more interested in
>>> disabling norms because scoring really doesn't matter for us. Basically, we
>>> need nothing more than a binary answer from Lucene: either a record meets
>>> the provided criteria (which can be a rather complex boolean query with
>>> many subqueries) or it doesn't. If the record does match, then we get the
>>> IDs from lucene and run off to get the live data from our primary data
>>> store and sort it (in Java) based upon criteria provided by the user, not
>>> by score.
>>>
>>> After our initial design mushroomed in size, we redesigned and now (I
>>> thought) do not have norms on any of the fields in this index. So, I'm
>>> wondering if there was something in the results from the CheckIndex that I
>>> provided which indicates to you that we may have norms still enabled? I
>>> know that if you have norms on any one document's field, then any other
>>> document with that same field will get "infected" with norms as well.
>>>
>>> My understanding is that any field that uses the constants
>>> Index.NOT_ANALYZED_NO_NORMS or Index.NO will not have norms on it,
>>> regardless of whether or not the field is stored. Is that not correct?
>>>
>>> Thanks,
>>> Mark
>>>
>>>
>>>
>>> On Nov 4, 2010, at 2:56 AM, Michael McCandless wrote:
>>>
>>>> Likely what happened is you had a bunch of smaller segments, and then
>>>> suddenly they got merged into that one big segment (_aiaz) in your
>>>> index.
>>>>
>>>> The representation for norms in particular is not sparse, so this
>>>> means the size of the norms file for a given segment will be
>>>> number-of-unique-indexed-fields X number-of-documents.
>>>>
>>>> So this count grows quadratically on merge.
>>>>
>>>> Do these fields really need to be indexed? If so, it'd be better to
>>>> use a single field for all users for the indexable text if you can.
>>>>
>>>> Failing that, a simple workaround is to set the maxMergeMB/Docs on the
>>>> merge policy; this'd prevent big segments from being produced.
>>>> Disabling norms should also workaround this, though that will affect
>>>> hit scores...
>>>>
>>>> Mike
>>>>
>>>> On Wed, Nov 3, 2010 at 7:37 PM, Mark Kristensson
>>>> <[email protected]> wrote:
>>>>> Yes, we do have a large number of unique field names in that index,
>>>>> because they are driven by user named fields in our application (with
>>>>> some cleaning to remove illegal chars).
>>>>>
>>>>> This slowness problem has appeared very suddenly in the last couple of
>>>>> weeks and the number of unique field names has not spiked in the last few
>>>>> weeks. Have we crept over some threshold with our linear growth in the
>>>>> number of unique field names? Perhaps there is a limit driven by the
>>>>> amount of RAM in the machine that we are violating? Are there any
>>>>> guidelines for the maximum number, or suggested number, of unique fields
>>>>> names in an index or segment? Any suggestions for potentially mitigating
>>>>> the problem?
>>>>>
>>>>> Thanks,
>>>>> Mark
>>>>>
>>>>>
>>>>> On Nov 3, 2010, at 2:02 PM, Michael McCandless wrote:
>>>>>
>>>>>> On Wed, Nov 3, 2010 at 4:27 PM, Mark Kristensson
>>>>>> <[email protected]> wrote:
>>>>>>>
>>>>>>> I've run checkIndex against the index and the results are below. That
>>>>>>> net is that it's telling me nothing is wrong with the index.
>>>>>>
>>>>>> Thanks.
>>>>>>
>>>>>>> I did not have any instrumentation around the opening of the
>>>>>>> IndexSearcher (we don't use an IndexReader), just around the actual
>>>>>>> query execution so I had to add some additional logging. What I found
>>>>>>> surprised me, opening a search against this index takes the same 6 to 8
>>>>>>> seconds that closing the indexWriter takes.
>>>>>>
>>>>>> IndexWriter opens a SegmentReader for each segment in the index, to
>>>>>> apply deletions, so I think this is the source of the slowness.
>>>>>>
>>>>>> From the CheckIndex output, it looks like you have many (296,713)
>>>>>> unique fields names on that one large segment -- does that sound
>>>>>> right? I suspect such a very high field count is the source of the
>>>>>> slowness...
>>>>>>
>>>>>> Mike
>>>>>>
>>>>>> ---------------------------------------------------------------------
>>>>>> To unsubscribe, e-mail: [email protected]
>>>>>> For additional commands, e-mail: [email protected]
>>>>>>
>>>>>
>>>>>
>>>>> ---------------------------------------------------------------------
>>>>> To unsubscribe, e-mail: [email protected]
>>>>> For additional commands, e-mail: [email protected]
>>>>>
>>>>>
>>>>
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe, e-mail: [email protected]
>>>> For additional commands, e-mail: [email protected]
>>>>
>>>
>>>
>>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [email protected]
> For additional commands, e-mail: [email protected]
>
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]