[graylog2] Re: Recovering from corrupt elasticsearch indexes

Trey Dockendorf Mon, 18 Jan 2016 13:29:23 -0800

Thanks for the suggestion.  Unfortunately the CheckIndex did not seem to 
find any problems with my broken indexes.  Both CheckIndex commands 
returned "No problems were detected with this index."


Deleting the index in Graylog worked, the cluster is now green again.

Thanks!
- Trey


On Friday, January 15, 2016 at 5:11:45 PM UTC-6, Jochen Schalanda wrote:
>
> Hi Trey,
>
> it looks like the underlying Lucene data files for the index "graylog2_6" 
> have been corrupted (e. g. because disk space has been exhausted or the 
> file system garbled them due to the power outage). You can try to recover 
> them with the rather low-level Lucene utility CheckIndex (see 
> https://joejulian.name/blog/how-to-check-and-fix-indexes-for-elasticsearchlogstash/
> ).
>
> If that fails, try to remove the corrupt index either through the Graylog 
> web interface (System -> Indices -> Delete) or via the Elasticsearch Delete 
> Index API (
> https://www.elastic.co/guide/en/elasticsearch/reference/1.7/indices-delete-index.html).
>  
> As a last resort, you can try deleting the index files from disk, bypassing 
> Elasticsearch, but you should delete them on every node in the cluster 
> while Elasticsearch is not running.
>
>
> Cheers,
> Jochen
>
> On Friday, 15 January 2016 22:41:25 UTC+1, Trey Dockendorf wrote:
>>
>> Currently the Elasticsearch /_cluster/health shows status of "red".  The 
>> error I get [1] shows some kind of corruption.  I tried these two commands 
>> to fix the unassigned shards.
>>
>> curl -XPOST 'localhost:9200/_cluster/reroute' -d 
>> '{"commands":[{"allocate":{"index":"graylog2_6","shard":1,"node":"gl-es01-esgl2","allow_primary":true}}]}'
>> curl -XPOST 'localhost:9200/_cluster/reroute' -d 
>> '{"commands":[{"allocate":{"index":"graylog2_6","shard":2,"node":"gl-es01-esgl2","allow_primary":true}}]}'
>>
>> Once those commands are executed the two shards flip between UNASSIGNED 
>> and INITIALIZING constantly and the logs rapidly print errors regarding the 
>> footer mismatch [1].
>>
>> I only have 1 Elasticsearch node at this time, and the last snapshot was 
>> a few weeks before this power outage.  I'd probably loose less data by 
>> removing the corrupt indexes, if that's possible and my only option.
>>
>> Thanks,
>> - Trey
>>
>> [1]:
>>
>> [2016-01-13 17:52:11,637][WARN ][cluster.action.shard     ] 
>> [gl-es01-esgl2] [graylog2_6][2] received shard failed for [graylog2_6][2], 
>> node[2c5VN2XqSNe8DG2Ix8BY9A], [P], s[INITIALIZING], 
>> unassigned_info[[reason=CLUSTER_RECOVERED], at[2016-01-13T23:52:09.754Z]], 
>> indexUUID [xBeW-OuARQqCHqjkHSVbVw], reason [shard failure [failed 
>> recovery][IndexShardGatewayRecoveryException[[graylog2_6][2] failed to 
>> fetch index version after copying it over]; nested: 
>> CorruptIndexException[[graylog2_6][2] Preexisting corrupted index 
>> [corrupted_y0yPuZiBSxuZxY10TXMDCQ] caused by: IOException[failed engine 
>> (reason: [corrupt file (source: [start])])]
>> CorruptIndexException[codec footer mismatch: actual footer=0 vs expected 
>> footer=-1071082520 (resource: 
>> MMapIndexInput(path="/var/lib/elasticsearch-data/esgl2/gl2/nodes/0/indices/graylog2_6/2/index/_ew6_Lucene41_0.tim"))]
>> java.io.IOException: failed engine (reason: [corrupt file (source: 
>> [start])])
>> at org.elasticsearch.index.engine.Engine.failEngine(Engine.java:492)
>> at org.elasticsearch.index.engine.Engine.maybeFailEngine(Engine.java:514)
>> at 
>> org.elasticsearch.index.engine.InternalEngine.maybeFailEngine(InternalEngine.java:928)
>> at 
>> org.elasticsearch.index.engine.InternalEngine.createSearcherManager(InternalEngine.java:195)
>> at 
>> org.elasticsearch.index.engine.InternalEngine.<init>(InternalEngine.java:146)
>> at 
>> org.elasticsearch.index.engine.InternalEngineFactory.newReadWriteEngine(InternalEngineFactory.java:32)
>> at 
>> org.elasticsearch.index.shard.IndexShard.newEngine(IndexShard.java:1355)
>> at 
>> org.elasticsearch.index.shard.IndexShard.createNewEngine(IndexShard.java:1350)
>> at 
>> org.elasticsearch.index.shard.IndexShard.prepareForTranslogRecovery(IndexShard.java:870)
>> at 
>> org.elasticsearch.index.gateway.local.LocalIndexShardGateway.recover(LocalIndexShardGateway.java:233)
>> at 
>> org.elasticsearch.index.gateway.IndexShardGatewayService$1.run(IndexShardGatewayService.java:112)
>> at 
>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>> at 
>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>> at java.lang.Thread.run(Thread.java:745)
>> Caused by: org.apache.lucene.index.CorruptIndexException: codec footer 
>> mismatch: actual footer=0 vs expected footer=-1071082520 (resource: 
>> MMapIndexInput(path="/var/lib/elasticsearch-data/esgl2/gl2/nodes/0/indices/graylog2_6/2/index/_ew6_Lucene41_0.tim"))
>> at org.apache.lucene.codecs.CodecUtil.validateFooter(CodecUtil.java:235)
>> at org.apache.lucene.codecs.CodecUtil.retrieveChecksum(CodecUtil.java:228)
>> at 
>> org.apache.lucene.codecs.blocktree.BlockTreeTermsReader.<init>(BlockTreeTermsReader.java:137)
>> at 
>> org.apache.lucene.codecs.lucene41.Lucene41PostingsFormat.fieldsProducer(Lucene41PostingsFormat.java:441)
>> at 
>> org.apache.lucene.codecs.perfield.PerFieldPostingsFormat$FieldsReader.<init>(PerFieldPostingsFormat.java:197)
>> at 
>> org.apache.lucene.codecs.perfield.PerFieldPostingsFormat.fieldsProducer(PerFieldPostingsFormat.java:254)
>> at 
>> org.apache.lucene.index.SegmentCoreReaders.<init>(SegmentCoreReaders.java:120)
>> at org.apache.lucene.index.SegmentReader.<init>(SegmentReader.java:108)
>> at 
>> org.apache.lucene.index.ReadersAndUpdates.getReader(ReadersAndUpdates.java:145)
>> at 
>> org.apache.lucene.index.ReadersAndUpdates.getReadOnlyClone(ReadersAndUpdates.java:239)
>> at 
>> org.apache.lucene.index.StandardDirectoryReader.open(StandardDirectoryReader.java:109)
>> at org.apache.lucene.index.IndexWriter.getReader(IndexWriter.java:421)
>> at org.apache.lucene.index.DirectoryReader.open(DirectoryReader.java:112)
>> at 
>> org.apache.lucene.search.SearcherManager.<init>(SearcherManager.java:89)
>> at 
>> org.elasticsearch.index.engine.InternalEngine.createSearcherManager(InternalEngine.java:186)
>> ... 10 more
>> ]; ]]
>> [2016-01-13 17:52:12,353][WARN ][cluster.action.shard     ] 
>> [gl-es01-esgl2] [graylog2_6][2] received shard failed for [graylog2_6][2], 
>> node[2c5VN2XqSNe8DG2Ix8BY9A], [P], s[INITIALIZING], 
>> unassigned_info[[reason=CLUSTER_RECOVERED], at[2016-01-13T23:52:09.754Z]], 
>> indexUUID [xBeW-OuARQqCHqjkHSVbVw], reason [master 
>> [gl-es01-esgl2][2c5VN2XqSNe8DG2Ix8BY9A][gl-es01.brazos.tamu.edu][inet[/
>> 192.168.200.93:9300]] marked shard as initializing, but shard is marked 
>> as failed, resend shard failure]
>>
>>
>> On Friday, January 15, 2016 at 3:19:22 AM UTC-6, Jochen Schalanda wrote:
>>>
>>> Hi Trey,
>>>
>>> do you see any other error messages in the logs of your Elasticsearch 
>>> node(s)? Graylog won't start writing into the Elasticsearch cluster again 
>>> until the health of index "graylog2_6" (which is probably the current 
>>> deflector target) is GREEN again.
>>>
>>>
>>> Cheers,
>>> Jochen
>>>
>>> On Wednesday, 13 January 2016 19:08:41 UTC+1, Trey Dockendorf wrote:
>>>>
>>>> Our data center recently had a catastrophic power loss.  Once 
>>>> everything was back up Graylog refuses to start [1] and the issue seems to 
>>>> be a corrupt elasticsearch indexes [2].  I've attempted rerouting the 
>>>> indexes but that has not worked.  I fear my only option is to delete the 
>>>> corrupt indexes, but I'm unsure what kind of impact that will have on 
>>>> Graylog.
>>>>
>>>> Any advice is greatly appreciated.
>>>>
>>>> Thanks,
>>>> - Trey
>>>>
>>>> [1]: 
>>>> 2016-01-13T12:01:24.886-06:00 WARN  [BlockingBatchedESOutput] Error 
>>>> while waiting for healthy Elasticsearch cluster. Not flushing.
>>>> java.util.concurrent.TimeoutException: Elasticsearch cluster didn't get 
>>>> healthy within timeout
>>>> at 
>>>> org.graylog2.indexer.cluster.Cluster.waitForConnectedAndHealthy(Cluster.java:174)
>>>> at 
>>>> org.graylog2.indexer.cluster.Cluster.waitForConnectedAndHealthy(Cluster.java:179)
>>>> at 
>>>> org.graylog2.outputs.BlockingBatchedESOutput.flush(BlockingBatchedESOutput.java:112)
>>>> at 
>>>> org.graylog2.outputs.BlockingBatchedESOutput.write(BlockingBatchedESOutput.java:105)
>>>> at 
>>>> org.graylog2.buffers.processors.OutputBufferProcessor$1.run(OutputBufferProcessor.java:189)
>>>> at 
>>>> com.codahale.metrics.InstrumentedExecutorService$InstrumentedRunnable.run(InstrumentedExecutorService.java:176)
>>>> at 
>>>> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
>>>> at java.util.concurrent.FutureTask.run(FutureTask.java:262)
>>>> at 
>>>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>>>> at 
>>>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>>>> at java.lang.Thread.run(Thread.java:745)
>>>> 2016-01-13T12:01:51.526-06:00 INFO  [IndexRetentionThread] 
>>>> Elasticsearch cluster not available, skipping index retention checks.
>>>>
>>>> [2]:
>>>> # curl -XGET http://localhost:9200/_cat/shards
>>>> graylog2_6 0 p STARTED    2700869 398.6mb 192.168.200.93 gl-es01-esgl2
>>>> graylog2_6 3 p STARTED    2694087   397mb 192.168.200.93 gl-es01-esgl2
>>>> graylog2_6 1 p UNASSIGNED
>>>> graylog2_6 2 p UNASSIGNED
>>>> graylog2_4 0 p STARTED    5040142 796.3mb 192.168.200.93 gl-es01-esgl2
>>>> graylog2_4 3 p STARTED    5003263 789.1mb 192.168.200.93 gl-es01-esgl2
>>>> graylog2_4 1 p STARTED    5001091 788.6mb 192.168.200.93 gl-es01-esgl2
>>>> graylog2_4 2 p STARTED    4958409 783.6mb 192.168.200.93 gl-es01-esgl2
>>>> graylog2_5 0 p STARTED    5019303 743.4mb 192.168.200.93 gl-es01-esgl2
>>>> graylog2_5 3 p STARTED    5001404 739.3mb 192.168.200.93 gl-es01-esgl2
>>>> graylog2_5 1 p STARTED    5000525 739.3mb 192.168.200.93 gl-es01-esgl2
>>>> graylog2_5 2 p STARTED    4979658 737.7mb 192.168.200.93 gl-es01-esgl2
>>>> graylog2_2 0 p STARTED    5023249 985.4mb 192.168.200.93 gl-es01-esgl2
>>>> graylog2_2 3 p STARTED    4999096 979.1mb 192.168.200.93 gl-es01-esgl2
>>>> graylog2_2 1 p STARTED    5001476 980.1mb 192.168.200.93 gl-es01-esgl2
>>>> graylog2_2 2 p STARTED    4976833 973.5mb 192.168.200.93 gl-es01-esgl2
>>>> graylog2_3 0 p STARTED    5000546     1gb 192.168.200.93 gl-es01-esgl2
>>>> graylog2_3 3 p STARTED    4998766     1gb 192.168.200.93 gl-es01-esgl2
>>>> graylog2_3 1 p STARTED    4999378     1gb 192.168.200.93 gl-es01-esgl2
>>>> graylog2_3 2 p STARTED    5001326     1gb 192.168.200.93 gl-es01-esgl2
>>>> graylog2_0 0 p STARTED    3686796 819.8mb 192.168.200.93 gl-es01-esgl2
>>>> graylog2_0 3 p STARTED    3655173 812.8mb 192.168.200.93 gl-es01-esgl2
>>>> graylog2_0 1 p STARTED    3655623   813mb 192.168.200.93 gl-es01-esgl2
>>>> graylog2_0 2 p STARTED    3625428 805.7mb 192.168.200.93 gl-es01-esgl2
>>>> graylog2_1 0 p STARTED    5053805     1gb 192.168.200.93 gl-es01-esgl2
>>>> graylog2_1 3 p STARTED    5000588     1gb 192.168.200.93 gl-es01-esgl2
>>>> graylog2_1 1 p STARTED    5001749     1gb 192.168.200.93 gl-es01-esgl2
>>>> graylog2_1 2 p STARTED    4943861     1gb 192.168.200.93 gl-es01-esgl2
>>>>
>>>>

-- 
You received this message because you are subscribed to the Google Groups 
"Graylog Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/graylog2/0e358e7e-7a39-401f-bc5e-dc826677919f%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

[graylog2] Re: Recovering from corrupt elasticsearch indexes

Reply via email to