[graylog2] Re: Recovering from corrupt elasticsearch indexes

Trey Dockendorf Fri, 15 Jan 2016 13:42:17 -0800

Currently the Elasticsearch /_cluster/health shows status of "red".  The 
error I get [1] shows some kind of corruption.  I tried these two commands 
to fix the unassigned shards.

curl -XPOST 'localhost:9200/_cluster/reroute' -d 
'{"commands":[{"allocate":{"index":"graylog2_6","shard":1,"node":"gl-es01-esgl2","allow_primary":true}}]}'
curl -XPOST 'localhost:9200/_cluster/reroute' -d 
'{"commands":[{"allocate":{"index":"graylog2_6","shard":2,"node":"gl-es01-esgl2","allow_primary":true}}]}'

Once those commands are executed the two shards flip between UNASSIGNED and 
INITIALIZING constantly and the logs rapidly print errors regarding the 
footer mismatch [1].

I only have 1 Elasticsearch node at this time, and the last snapshot was a 
few weeks before this power outage.  I'd probably loose less data by 
removing the corrupt indexes, if that's possible and my only option.

Thanks,
- Trey

[1]:

[2016-01-13 17:52:11,637][WARN ][cluster.action.shard     ] [gl-es01-esgl2] 
[graylog2_6][2] received shard failed for [graylog2_6][2], 
node[2c5VN2XqSNe8DG2Ix8BY9A], [P], s[INITIALIZING], 
unassigned_info[[reason=CLUSTER_RECOVERED], at[2016-01-13T23:52:09.754Z]], 
indexUUID [xBeW-OuARQqCHqjkHSVbVw], reason [shard failure [failed 
recovery][IndexShardGatewayRecoveryException[[graylog2_6][2] failed to 
fetch index version after copying it over]; nested: 
CorruptIndexException[[graylog2_6][2] Preexisting corrupted index 
[corrupted_y0yPuZiBSxuZxY10TXMDCQ] caused by: IOException[failed engine 
(reason: [corrupt file (source: [start])])]
CorruptIndexException[codec footer mismatch: actual footer=0 vs expected 
footer=-1071082520 (resource: 
MMapIndexInput(path="/var/lib/elasticsearch-data/esgl2/gl2/nodes/0/indices/graylog2_6/2/index/_ew6_Lucene41_0.tim"))]
java.io.IOException: failed engine (reason: [corrupt file (source: 
[start])])
at org.elasticsearch.index.engine.Engine.failEngine(Engine.java:492)
at org.elasticsearch.index.engine.Engine.maybeFailEngine(Engine.java:514)
at 
org.elasticsearch.index.engine.InternalEngine.maybeFailEngine(InternalEngine.java:928)
at 
org.elasticsearch.index.engine.InternalEngine.createSearcherManager(InternalEngine.java:195)
at 
org.elasticsearch.index.engine.InternalEngine.<init>(InternalEngine.java:146)
at 
org.elasticsearch.index.engine.InternalEngineFactory.newReadWriteEngine(InternalEngineFactory.java:32)
at org.elasticsearch.index.shard.IndexShard.newEngine(IndexShard.java:1355)
at 
org.elasticsearch.index.shard.IndexShard.createNewEngine(IndexShard.java:1350)
at 
org.elasticsearch.index.shard.IndexShard.prepareForTranslogRecovery(IndexShard.java:870)
at 
org.elasticsearch.index.gateway.local.LocalIndexShardGateway.recover(LocalIndexShardGateway.java:233)
at 
org.elasticsearch.index.gateway.IndexShardGatewayService$1.run(IndexShardGatewayService.java:112)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
Caused by: org.apache.lucene.index.CorruptIndexException: codec footer 
mismatch: actual footer=0 vs expected footer=-1071082520 (resource: 
MMapIndexInput(path="/var/lib/elasticsearch-data/esgl2/gl2/nodes/0/indices/graylog2_6/2/index/_ew6_Lucene41_0.tim"))
at org.apache.lucene.codecs.CodecUtil.validateFooter(CodecUtil.java:235)
at org.apache.lucene.codecs.CodecUtil.retrieveChecksum(CodecUtil.java:228)
at 
org.apache.lucene.codecs.blocktree.BlockTreeTermsReader.<init>(BlockTreeTermsReader.java:137)
at 
org.apache.lucene.codecs.lucene41.Lucene41PostingsFormat.fieldsProducer(Lucene41PostingsFormat.java:441)
at 
org.apache.lucene.codecs.perfield.PerFieldPostingsFormat$FieldsReader.<init>(PerFieldPostingsFormat.java:197)
at 
org.apache.lucene.codecs.perfield.PerFieldPostingsFormat.fieldsProducer(PerFieldPostingsFormat.java:254)
at 
org.apache.lucene.index.SegmentCoreReaders.<init>(SegmentCoreReaders.java:120)
at org.apache.lucene.index.SegmentReader.<init>(SegmentReader.java:108)
at 
org.apache.lucene.index.ReadersAndUpdates.getReader(ReadersAndUpdates.java:145)
at 
org.apache.lucene.index.ReadersAndUpdates.getReadOnlyClone(ReadersAndUpdates.java:239)
at 
org.apache.lucene.index.StandardDirectoryReader.open(StandardDirectoryReader.java:109)
at org.apache.lucene.index.IndexWriter.getReader(IndexWriter.java:421)
at org.apache.lucene.index.DirectoryReader.open(DirectoryReader.java:112)
at org.apache.lucene.search.SearcherManager.<init>(SearcherManager.java:89)
at 
org.elasticsearch.index.engine.InternalEngine.createSearcherManager(InternalEngine.java:186)
... 10 more
]; ]]
[2016-01-13 17:52:12,353][WARN ][cluster.action.shard     ] [gl-es01-esgl2] 
[graylog2_6][2] received shard failed for [graylog2_6][2], 
node[2c5VN2XqSNe8DG2Ix8BY9A], [P], s[INITIALIZING], 
unassigned_info[[reason=CLUSTER_RECOVERED], at[2016-01-13T23:52:09.754Z]], 
indexUUID [xBeW-OuARQqCHqjkHSVbVw], reason [master 
[gl-es01-esgl2][2c5VN2XqSNe8DG2Ix8BY9A][gl-es01.brazos.tamu.edu][inet[/192.168.200.93:9300]]

marked shard as initializing, but shard is marked as failed, resend shard 
failure]

On Friday, January 15, 2016 at 3:19:22 AM UTC-6, Jochen Schalanda wrote:
>
> Hi Trey,
>
> do you see any other error messages in the logs of your Elasticsearch 
> node(s)? Graylog won't start writing into the Elasticsearch cluster again 
> until the health of index "graylog2_6" (which is probably the current 
> deflector target) is GREEN again.
>
>
> Cheers,
> Jochen
>
> On Wednesday, 13 January 2016 19:08:41 UTC+1, Trey Dockendorf wrote:
>>
>> Our data center recently had a catastrophic power loss.  Once everything 
>> was back up Graylog refuses to start [1] and the issue seems to be a 
>> corrupt elasticsearch indexes [2].  I've attempted rerouting the indexes 
>> but that has not worked.  I fear my only option is to delete the corrupt 
>> indexes, but I'm unsure what kind of impact that will have on Graylog.
>>
>> Any advice is greatly appreciated.
>>
>> Thanks,
>> - Trey
>>
>> [1]: 
>> 2016-01-13T12:01:24.886-06:00 WARN  [BlockingBatchedESOutput] Error while 
>> waiting for healthy Elasticsearch cluster. Not flushing.
>> java.util.concurrent.TimeoutException: Elasticsearch cluster didn't get 
>> healthy within timeout
>> at 
>> org.graylog2.indexer.cluster.Cluster.waitForConnectedAndHealthy(Cluster.java:174)
>> at 
>> org.graylog2.indexer.cluster.Cluster.waitForConnectedAndHealthy(Cluster.java:179)
>> at 
>> org.graylog2.outputs.BlockingBatchedESOutput.flush(BlockingBatchedESOutput.java:112)
>> at 
>> org.graylog2.outputs.BlockingBatchedESOutput.write(BlockingBatchedESOutput.java:105)
>> at 
>> org.graylog2.buffers.processors.OutputBufferProcessor$1.run(OutputBufferProcessor.java:189)
>> at 
>> com.codahale.metrics.InstrumentedExecutorService$InstrumentedRunnable.run(InstrumentedExecutorService.java:176)
>> at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
>> at java.util.concurrent.FutureTask.run(FutureTask.java:262)
>> at 
>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>> at 
>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>> at java.lang.Thread.run(Thread.java:745)
>> 2016-01-13T12:01:51.526-06:00 INFO  [IndexRetentionThread] Elasticsearch 
>> cluster not available, skipping index retention checks.
>>
>> [2]:
>> # curl -XGET http://localhost:9200/_cat/shards
>> graylog2_6 0 p STARTED    2700869 398.6mb 192.168.200.93 gl-es01-esgl2
>> graylog2_6 3 p STARTED    2694087   397mb 192.168.200.93 gl-es01-esgl2
>> graylog2_6 1 p UNASSIGNED
>> graylog2_6 2 p UNASSIGNED
>> graylog2_4 0 p STARTED    5040142 796.3mb 192.168.200.93 gl-es01-esgl2
>> graylog2_4 3 p STARTED    5003263 789.1mb 192.168.200.93 gl-es01-esgl2
>> graylog2_4 1 p STARTED    5001091 788.6mb 192.168.200.93 gl-es01-esgl2
>> graylog2_4 2 p STARTED    4958409 783.6mb 192.168.200.93 gl-es01-esgl2
>> graylog2_5 0 p STARTED    5019303 743.4mb 192.168.200.93 gl-es01-esgl2
>> graylog2_5 3 p STARTED    5001404 739.3mb 192.168.200.93 gl-es01-esgl2
>> graylog2_5 1 p STARTED    5000525 739.3mb 192.168.200.93 gl-es01-esgl2
>> graylog2_5 2 p STARTED    4979658 737.7mb 192.168.200.93 gl-es01-esgl2
>> graylog2_2 0 p STARTED    5023249 985.4mb 192.168.200.93 gl-es01-esgl2
>> graylog2_2 3 p STARTED    4999096 979.1mb 192.168.200.93 gl-es01-esgl2
>> graylog2_2 1 p STARTED    5001476 980.1mb 192.168.200.93 gl-es01-esgl2
>> graylog2_2 2 p STARTED    4976833 973.5mb 192.168.200.93 gl-es01-esgl2
>> graylog2_3 0 p STARTED    5000546     1gb 192.168.200.93 gl-es01-esgl2
>> graylog2_3 3 p STARTED    4998766     1gb 192.168.200.93 gl-es01-esgl2
>> graylog2_3 1 p STARTED    4999378     1gb 192.168.200.93 gl-es01-esgl2
>> graylog2_3 2 p STARTED    5001326     1gb 192.168.200.93 gl-es01-esgl2
>> graylog2_0 0 p STARTED    3686796 819.8mb 192.168.200.93 gl-es01-esgl2
>> graylog2_0 3 p STARTED    3655173 812.8mb 192.168.200.93 gl-es01-esgl2
>> graylog2_0 1 p STARTED    3655623   813mb 192.168.200.93 gl-es01-esgl2
>> graylog2_0 2 p STARTED    3625428 805.7mb 192.168.200.93 gl-es01-esgl2
>> graylog2_1 0 p STARTED    5053805     1gb 192.168.200.93 gl-es01-esgl2
>> graylog2_1 3 p STARTED    5000588     1gb 192.168.200.93 gl-es01-esgl2
>> graylog2_1 1 p STARTED    5001749     1gb 192.168.200.93 gl-es01-esgl2
>> graylog2_1 2 p STARTED    4943861     1gb 192.168.200.93 gl-es01-esgl2
>>
>

-- 
You received this message because you are subscribed to the Google Groups 
"Graylog Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/graylog2/8aea996d-56f1-4d66-9b30-319dffcbbd93%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

[graylog2] Re: Recovering from corrupt elasticsearch indexes

Reply via email to