[jira] [Commented] (LUCENE-4930) Lucene's use of WeakHashMap at index time prevents full use of cores on some multi-core machines, due to contention

Uwe Schindler (JIRA) Fri, 12 Apr 2013 11:48:17 -0700

    [ 
https://issues.apache.org/jira/browse/LUCENE-4930?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13630472#comment-13630472
 ]


Uwe Schindler commented on LUCENE-4930:
---------------------------------------

Hi,

{quote}
However I'm not really sure if I understand the reason for lucene to actually 
use a WeakKeyHashMap here:
I may be wrong but wouldn't that reap actually only happen when the Interface 
class itself is unloaded? That should be an extremely rare thing, or? If I 
understand the purpose of that code correctly it is meant to prevent a memory 
wasting for cases where the user does incremental indexing from time to time. 
In that case the attribute source would prevent the interface class and 
implementation class from being garbage collected in the mean time. But is that 
case actually really worth the effort (I don't know how big the memory 
footprint for an Attribute implementation class usually is)? I mean that would 
only affect the static fields here (and in plain lucene I could not find many 
of those) ...
{quote}

The issue is not class unloading in your own application while it is running. 
The VM will never do this. It will *only* unload classes, when the ClassLoader 
is released. This happens e.g. when you redeploy your webapplication in your 
Jetty or Tomcat container or (and this is the most important reason) when you 
reload Solr cores: If you have a custom analyzer JAR file in your plugins 
directory that uses custom attributes (like lucene-kuromoji.jar Japanese 
Analyzer), your would have a memory leak. Solr loads plugins in its own 
classloader. If you restart a core it reinitializes its plugins and releases 
the old classloader. If the AttributeSource would refer to these classes, they 
could never be unloaded. The same happens if you have a webapp that uses a 
lucene-core.jar file from outside the webapp (e.g. from Ubuntu repository in 
/usr/lib), but has own analyzers shipped in the webapp. In that case, the 
classes could not be unloaded on webapp shutdown.

The WeakIdentityMap prevents this big resource leak (permgen issue). If you 
wonder: The values in the map also have a WeakReference, because the key's weak 
reference and the Map.Entry is only removed when you actually call get() on the 
map. If you unload the webapp, nobody calls get() anymore, so all Map.Entry 
would refer to the classes and are never removed.

One optimization might be possible: As the number of classes in this map is 
very low and the important thing is to release the class reference when no 
longer needed, we could add an option to WeakIdentityMap to make reap() a 
no-op. This would keep the WeakReference and Map.Entrys in the map, but the 
classes could get freed. The small overhead (you can count the number of 
entries on your fingers) would be minimal and the lost WeakReferences in the 
map would be no problem.

Another approach would be to make DefaultAttributeSource have a lookup table 
(without weak keys) on all Lucene-Internal attributes (which are the only ones 
actually used by IndexWriter). I would prefer this approach.

In general the big issue you see in Lucene 4.x is the fact that StringField 
does not reuse its TokenStream (see LUCENE-4931). This would be easy to fix. 
But this requires that you reuse StringField instances (like the primary key) 
in your Documents.
                
> Lucene's use of WeakHashMap at index time prevents full use of cores on some 
> multi-core machines, due to contention
> -------------------------------------------------------------------------------------------------------------------
>
>                 Key: LUCENE-4930
>                 URL: https://issues.apache.org/jira/browse/LUCENE-4930
>             Project: Lucene - Core
>          Issue Type: Improvement
>          Components: core/index
>    Affects Versions: 4.2
>         Environment: Dell blade system with 16 cores
>            Reporter: Karl Wright
>         Attachments: thread_dump.txt
>
>
> Our project is not optimally using full processing power during under 
> indexing load on Lucene 4.2.0.  The reason is the 
> AttributeSource.addAttribute() method, which goes through a WeakHashMap 
> synchronizer, which is apparently single-threaded for a significant amount of 
> time.  Have a look at the following trace:
> "pool-1-thread-28" prio=10 tid=0x00007f47fc104800 nid=0x672b waiting for 
> monitor entry [0x00007f47d19ed000]
>    java.lang.Thread.State: BLOCKED (on object monitor)
>         at java.lang.ref.ReferenceQueue.poll(ReferenceQueue.java:98)
>         - waiting to lock <0x00000005c5cd9988> (a 
> java.lang.ref.ReferenceQueue$Lock)
>         at 
> org.apache.lucene.util.WeakIdentityMap.reap(WeakIdentityMap.java:189)
>         at org.apache.lucene.util.WeakIdentityMap.get(WeakIdentityMap.java:82)
>         at 
> org.apache.lucene.util.AttributeSource$AttributeFactory$DefaultAttributeFactory.getClassForInterface(AttributeSource.java:74)
>         at 
> org.apache.lucene.util.AttributeSource$AttributeFactory$DefaultAttributeFactory.createAttributeInstance(AttributeSource.java:65)
>         at 
> org.apache.lucene.util.AttributeSource.addAttribute(AttributeSource.java:271)
>         at 
> org.apache.lucene.index.DocInverterPerField.processFields(DocInverterPerField.java:107)
>         at 
> org.apache.lucene.index.DocFieldProcessor.processDocument(DocFieldProcessor.java:254)
>         at 
> org.apache.lucene.index.DocumentsWriterPerThread.updateDocument(DocumentsWriterPerThread.java:256)
>         at 
> org.apache.lucene.index.DocumentsWriter.updateDocument(DocumentsWriter.java:376)
>         at 
> org.apache.lucene.index.IndexWriter.updateDocument(IndexWriter.java:1473)
>         at 
> org.apache.lucene.index.IndexWriter.addDocument(IndexWriter.java:1148)
>         at 
> org.apache.lucene.index.IndexWriter.addDocument(IndexWriter.java:1129)
> …
> We’ve had to make significant changes to the way we were indexing in order to 
> not hit this issue as much, such as indexing using TokenStreams which we 
> reuse, when it would have been more convenient to index with just tokens.  
> (The reason is that Lucene internally creates TokenStream objects when you 
> pass a token array to IndexableField, and doesn’t reuse them, and the 
> addAttribute() causes massive contention as a result.)  However, as you can 
> see from the trace above, we’re still running into contention due to other 
> addAttribute() method calls that are buried deep inside Lucene.
> I can see two ways forward.  Either not use WeakHashMap or use it in a more 
> efficient way, or make darned sure no addAttribute() calls are done in the 
> main code indexing execution path.  (I think it would be easy to fix 
> DocInverterPerField in that way, FWIW.  I just don’t know what we’ll run into 
> next.)

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (LUCENE-4930) Lucene's use of WeakHashMap at index time prevents full use of cores on some multi-core machines, due to contention

Reply via email to