[jira] [Commented] (LUCENE-1761) low level Field metadata is never removed from index

Paul Smith (JIRA) Thu, 03 Apr 2014 16:50:13 -0700

    [ 
https://issues.apache.org/jira/browse/LUCENE-1761?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13959439#comment-13959439
 ]


Paul Smith commented on LUCENE-1761:
------------------------------------

[This comment stems from a discussion on the ElasticSearch mailing list: 
http://elasticsearch-users.115913.n3.nabble.com/Removing-unused-fields-more-Lucene-than-ES-but-td4053205.html]

Just thought I would jot down some real world ‘experience’ on why I would love 
to see this addressed.  For background, we use ElasticSearch 0.19.10 at the 
moment (yes this is old…), which is using Lucene 3.6.1 under the hood.  An 
application bug slipped in and was starting to generate dynamic fields within 
the ES index.  Eventually it got to 25,000 unique fields, and performance was 
super-sluggish.  We eliminated the data that was using these fields, and tried 
optimising but of course this doesn’t help.  Using Luke we can still see the 
fields are in there even after a complete optimize (which this bug report is 
about).

Within ElasticSearch, the stack trace commonly seen here looks like this:

{noformat}

"elasticsearch[app5.syd.acx][index][T#6398]" daemon prio=10 
tid=0x00007fb59459b000 nid=0x7f9a runnable [0x00007fb156407000]
   java.lang.Thread.State: RUNNABLE
        at java.util.Arrays.copyOfRange(Arrays.java:3209)
        at java.lang.String.<init>(String.java:215)
        at java.lang.StringBuilder.toString(StringBuilder.java:430)
        at 
org.apache.lucene.index.IndexFileNames.segmentFileName(IndexFileNames.java:227)
        at 
org.apache.lucene.index.IndexFileNames.fileNameFromGeneration(IndexFileNames.java:189)
        at 
org.apache.lucene.index.SegmentInfo.getNormFileName(SegmentInfo.java:508)
        at 
org.apache.lucene.index.SegmentReader.reopenSegment(SegmentReader.java:232)
        - locked <0x000000073c73c5e8> (a org.apache.lucene.index.SegmentReader)
        at org.apache.lucene.index.SegmentReader.clone(SegmentReader.java:207)
        - locked <0x000000073c73c5e8> (a org.apache.lucene.index.SegmentReader)
        at 
org.apache.lucene.index.IndexWriter$ReaderPool.getReadOnlyClone(IndexWriter.java:656)
        - locked <0x000000070b83cd38> (a 
org.apache.lucene.index.IndexWriter$ReaderPool)
        at 
org.apache.lucene.index.DirectoryReader.<init>(DirectoryReader.java:142)
        at 
org.apache.lucene.index.ReadOnlyDirectoryReader.<init>(ReadOnlyDirectoryReader.java:36)
        at org.apache.lucene.index.IndexWriter.getReader(IndexWriter.java:451)
        - locked <0x000000070029ac98> (a org.apache.lucene.index.XIndexWriter)
        at org.apache.lucene.index.IndexWriter.getReader(IndexWriter.java:399)
        at 
org.apache.lucene.index.DirectoryReader.doOpenFromWriter(DirectoryReader.java:413)
        at 
org.apache.lucene.index.DirectoryReader.doOpenIfChanged(DirectoryReader.java:432)
        at 
org.apache.lucene.index.DirectoryReader.doOpenIfChanged(DirectoryReader.java:375)
        at 
org.apache.lucene.index.IndexReader.openIfChanged(IndexReader.java:508)
        at 
org.apache.lucene.search.SearcherManager.refreshIfNeeded(SearcherManager.java:109)
        at 
org.apache.lucene.search.SearcherManager.refreshIfNeeded(SearcherManager.java:57)
        at 
org.apache.lucene.search.ReferenceManager.maybeRefresh(ReferenceManager.java:137)
        at 
org.elasticsearch.index.engine.robin.RobinEngine.refresh(RobinEngine.java:767)
        - locked <0x00000007001b8ec0> (a java.lang.Object)
        at 
org.elasticsearch.index.engine.robin.RobinEngine.refreshVersioningTable(RobinEngine.java:926)
        at 
org.elasticsearch.index.engine.robin.RobinEngine.delete(RobinEngine.java:727)
        at 
org.elasticsearch.index.shard.service.InternalIndexShard.deleteByQuery(InternalIndexShard.java:381)
        at 
org.elasticsearch.action.deletebyquery.TransportShardDeleteByQueryAction.shardOperationOnReplica(TransportShardDeleteByQueryAction.java:106)
        at 
org.elasticsearch.action.support.replication.TransportShardReplicationOperationAction$ReplicaOperationTransportHandler.messageReceived(TransportShardReplicationOperationAction.java:255)
        at 
org.elasticsearch.action.support.replication.TransportShardReplicationOperationAction$ReplicaOperationTransportHandler.messageReceived(TransportShardReplicationOperationAction.java:241)
        at 
org.elasticsearch.transport.netty.MessageChannelHandler$RequestHandler.run(MessageChannelHandler.java:268)
        at 
java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
        at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
        at java.lang.Thread.run(Thread.java:662)
{noformat}

The crutch here is within SegmentReader.java:

{noformat}
    boolean[] fieldNormsChanged = new boolean[core.fieldInfos.size()];
    final int fieldCount = core.fieldInfos.size();
    for (int i = 0; i < fieldCount; i++) {
      if (!this.si.getNormFileName(i).equals(si.getNormFileName(i))) {
        normsUpToDate = false;
        fieldNormsChanged[i] = true;
      }
    }

{noformat}

The reopen causes an iteration the size of the # fields.  The outer ES loop is 
a function of the # deleted documents.    So the pathological case here is a 
high-write intensive workload (us) with a lot of deleted documents accumulating 
with a large # fields... ouch.

At first I thought optimizing would help a lot (which it does a bit) but the 
inner loop of 25,000 fields in this case is a killer.

The performance degradation is probably more about the way ElasticSearch works, 
but I think having within the Lucene structure dead stuff that isn’t self 
cleaned is asking for trouble, and honestly shouldn’t be that hard to fix, 
right?  If dead fields are culled as the segments are merged, this would just 
fix itself naturally wouldn’t it?

Anyway I just wanted to highlight a specific, painful experience where the lack 
of options here to recover really hurt us. So appreciate if this gets worked 
on! :)

> low level Field metadata is never removed from index
> ----------------------------------------------------
>
>                 Key: LUCENE-1761
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1761
>             Project: Lucene - Core
>          Issue Type: Bug
>          Components: core/index
>    Affects Versions: 2.2, 2.3, 2.3.1, 2.3.2, 2.4, 2.4.1
>            Reporter: Hoss Man
>            Priority: Minor
>              Labels: gsoc2014
>         Attachments: LUCENE-1761.patch
>
>
> with heterogeneous docs, or an index whose fields evolve over time, field 
> names that are no longer used (ie: all docs that ever referenced them have 
> been deleted) still show up when you use IndexReader.getFieldNames.
> It seems logical that segment merging should only preserve metadata about 
> fields that actually existing the new segment, but even after deleting all 
> documents from an index and optimizing the old field names are still present.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (LUCENE-1761) low level Field metadata is never removed from index

Reply via email to