[
https://issues.apache.org/jira/browse/LUCENE-1761?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13959439#comment-13959439
]
Paul Smith commented on LUCENE-1761:
------------------------------------
[This comment stems from a discussion on the ElasticSearch mailing list:
http://elasticsearch-users.115913.n3.nabble.com/Removing-unused-fields-more-Lucene-than-ES-but-td4053205.html]
Just thought I would jot down some real world ‘experience’ on why I would love
to see this addressed. For background, we use ElasticSearch 0.19.10 at the
moment (yes this is old…), which is using Lucene 3.6.1 under the hood. An
application bug slipped in and was starting to generate dynamic fields within
the ES index. Eventually it got to 25,000 unique fields, and performance was
super-sluggish. We eliminated the data that was using these fields, and tried
optimising but of course this doesn’t help. Using Luke we can still see the
fields are in there even after a complete optimize (which this bug report is
about).
Within ElasticSearch, the stack trace commonly seen here looks like this:
{noformat}
"elasticsearch[app5.syd.acx][index][T#6398]" daemon prio=10
tid=0x00007fb59459b000 nid=0x7f9a runnable [0x00007fb156407000]
java.lang.Thread.State: RUNNABLE
at java.util.Arrays.copyOfRange(Arrays.java:3209)
at java.lang.String.<init>(String.java:215)
at java.lang.StringBuilder.toString(StringBuilder.java:430)
at
org.apache.lucene.index.IndexFileNames.segmentFileName(IndexFileNames.java:227)
at
org.apache.lucene.index.IndexFileNames.fileNameFromGeneration(IndexFileNames.java:189)
at
org.apache.lucene.index.SegmentInfo.getNormFileName(SegmentInfo.java:508)
at
org.apache.lucene.index.SegmentReader.reopenSegment(SegmentReader.java:232)
- locked <0x000000073c73c5e8> (a org.apache.lucene.index.SegmentReader)
at org.apache.lucene.index.SegmentReader.clone(SegmentReader.java:207)
- locked <0x000000073c73c5e8> (a org.apache.lucene.index.SegmentReader)
at
org.apache.lucene.index.IndexWriter$ReaderPool.getReadOnlyClone(IndexWriter.java:656)
- locked <0x000000070b83cd38> (a
org.apache.lucene.index.IndexWriter$ReaderPool)
at
org.apache.lucene.index.DirectoryReader.<init>(DirectoryReader.java:142)
at
org.apache.lucene.index.ReadOnlyDirectoryReader.<init>(ReadOnlyDirectoryReader.java:36)
at org.apache.lucene.index.IndexWriter.getReader(IndexWriter.java:451)
- locked <0x000000070029ac98> (a org.apache.lucene.index.XIndexWriter)
at org.apache.lucene.index.IndexWriter.getReader(IndexWriter.java:399)
at
org.apache.lucene.index.DirectoryReader.doOpenFromWriter(DirectoryReader.java:413)
at
org.apache.lucene.index.DirectoryReader.doOpenIfChanged(DirectoryReader.java:432)
at
org.apache.lucene.index.DirectoryReader.doOpenIfChanged(DirectoryReader.java:375)
at
org.apache.lucene.index.IndexReader.openIfChanged(IndexReader.java:508)
at
org.apache.lucene.search.SearcherManager.refreshIfNeeded(SearcherManager.java:109)
at
org.apache.lucene.search.SearcherManager.refreshIfNeeded(SearcherManager.java:57)
at
org.apache.lucene.search.ReferenceManager.maybeRefresh(ReferenceManager.java:137)
at
org.elasticsearch.index.engine.robin.RobinEngine.refresh(RobinEngine.java:767)
- locked <0x00000007001b8ec0> (a java.lang.Object)
at
org.elasticsearch.index.engine.robin.RobinEngine.refreshVersioningTable(RobinEngine.java:926)
at
org.elasticsearch.index.engine.robin.RobinEngine.delete(RobinEngine.java:727)
at
org.elasticsearch.index.shard.service.InternalIndexShard.deleteByQuery(InternalIndexShard.java:381)
at
org.elasticsearch.action.deletebyquery.TransportShardDeleteByQueryAction.shardOperationOnReplica(TransportShardDeleteByQueryAction.java:106)
at
org.elasticsearch.action.support.replication.TransportShardReplicationOperationAction$ReplicaOperationTransportHandler.messageReceived(TransportShardReplicationOperationAction.java:255)
at
org.elasticsearch.action.support.replication.TransportShardReplicationOperationAction$ReplicaOperationTransportHandler.messageReceived(TransportShardReplicationOperationAction.java:241)
at
org.elasticsearch.transport.netty.MessageChannelHandler$RequestHandler.run(MessageChannelHandler.java:268)
at
java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
at java.lang.Thread.run(Thread.java:662)
{noformat}
The crutch here is within SegmentReader.java:
{noformat}
boolean[] fieldNormsChanged = new boolean[core.fieldInfos.size()];
final int fieldCount = core.fieldInfos.size();
for (int i = 0; i < fieldCount; i++) {
if (!this.si.getNormFileName(i).equals(si.getNormFileName(i))) {
normsUpToDate = false;
fieldNormsChanged[i] = true;
}
}
{noformat}
The reopen causes an iteration the size of the # fields. The outer ES loop is
a function of the # deleted documents. So the pathological case here is a
high-write intensive workload (us) with a lot of deleted documents accumulating
with a large # fields... ouch.
At first I thought optimizing would help a lot (which it does a bit) but the
inner loop of 25,000 fields in this case is a killer.
The performance degradation is probably more about the way ElasticSearch works,
but I think having within the Lucene structure dead stuff that isn’t self
cleaned is asking for trouble, and honestly shouldn’t be that hard to fix,
right? If dead fields are culled as the segments are merged, this would just
fix itself naturally wouldn’t it?
Anyway I just wanted to highlight a specific, painful experience where the lack
of options here to recover really hurt us. So appreciate if this gets worked
on! :)
> low level Field metadata is never removed from index
> ----------------------------------------------------
>
> Key: LUCENE-1761
> URL: https://issues.apache.org/jira/browse/LUCENE-1761
> Project: Lucene - Core
> Issue Type: Bug
> Components: core/index
> Affects Versions: 2.2, 2.3, 2.3.1, 2.3.2, 2.4, 2.4.1
> Reporter: Hoss Man
> Priority: Minor
> Labels: gsoc2014
> Attachments: LUCENE-1761.patch
>
>
> with heterogeneous docs, or an index whose fields evolve over time, field
> names that are no longer used (ie: all docs that ever referenced them have
> been deleted) still show up when you use IndexReader.getFieldNames.
> It seems logical that segment merging should only preserve metadata about
> fields that actually existing the new segment, but even after deleting all
> documents from an index and optimizing the old field names are still present.
--
This message was sent by Atlassian JIRA
(v6.2#6252)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]