Thanks for the suggestion. Unfortunately the CheckIndex did not seem to find any problems with my broken indexes. Both CheckIndex commands returned "No problems were detected with this index."
Deleting the index in Graylog worked, the cluster is now green again. Thanks! - Trey On Friday, January 15, 2016 at 5:11:45 PM UTC-6, Jochen Schalanda wrote: > > Hi Trey, > > it looks like the underlying Lucene data files for the index "graylog2_6" > have been corrupted (e. g. because disk space has been exhausted or the > file system garbled them due to the power outage). You can try to recover > them with the rather low-level Lucene utility CheckIndex (see > https://joejulian.name/blog/how-to-check-and-fix-indexes-for-elasticsearchlogstash/ > ). > > If that fails, try to remove the corrupt index either through the Graylog > web interface (System -> Indices -> Delete) or via the Elasticsearch Delete > Index API ( > https://www.elastic.co/guide/en/elasticsearch/reference/1.7/indices-delete-index.html). > > As a last resort, you can try deleting the index files from disk, bypassing > Elasticsearch, but you should delete them on every node in the cluster > while Elasticsearch is not running. > > > Cheers, > Jochen > > On Friday, 15 January 2016 22:41:25 UTC+1, Trey Dockendorf wrote: >> >> Currently the Elasticsearch /_cluster/health shows status of "red". The >> error I get [1] shows some kind of corruption. I tried these two commands >> to fix the unassigned shards. >> >> curl -XPOST 'localhost:9200/_cluster/reroute' -d >> '{"commands":[{"allocate":{"index":"graylog2_6","shard":1,"node":"gl-es01-esgl2","allow_primary":true}}]}' >> curl -XPOST 'localhost:9200/_cluster/reroute' -d >> '{"commands":[{"allocate":{"index":"graylog2_6","shard":2,"node":"gl-es01-esgl2","allow_primary":true}}]}' >> >> Once those commands are executed the two shards flip between UNASSIGNED >> and INITIALIZING constantly and the logs rapidly print errors regarding the >> footer mismatch [1]. >> >> I only have 1 Elasticsearch node at this time, and the last snapshot was >> a few weeks before this power outage. I'd probably loose less data by >> removing the corrupt indexes, if that's possible and my only option. >> >> Thanks, >> - Trey >> >> [1]: >> >> [2016-01-13 17:52:11,637][WARN ][cluster.action.shard ] >> [gl-es01-esgl2] [graylog2_6][2] received shard failed for [graylog2_6][2], >> node[2c5VN2XqSNe8DG2Ix8BY9A], [P], s[INITIALIZING], >> unassigned_info[[reason=CLUSTER_RECOVERED], at[2016-01-13T23:52:09.754Z]], >> indexUUID [xBeW-OuARQqCHqjkHSVbVw], reason [shard failure [failed >> recovery][IndexShardGatewayRecoveryException[[graylog2_6][2] failed to >> fetch index version after copying it over]; nested: >> CorruptIndexException[[graylog2_6][2] Preexisting corrupted index >> [corrupted_y0yPuZiBSxuZxY10TXMDCQ] caused by: IOException[failed engine >> (reason: [corrupt file (source: [start])])] >> CorruptIndexException[codec footer mismatch: actual footer=0 vs expected >> footer=-1071082520 (resource: >> MMapIndexInput(path="/var/lib/elasticsearch-data/esgl2/gl2/nodes/0/indices/graylog2_6/2/index/_ew6_Lucene41_0.tim"))] >> java.io.IOException: failed engine (reason: [corrupt file (source: >> [start])]) >> at org.elasticsearch.index.engine.Engine.failEngine(Engine.java:492) >> at org.elasticsearch.index.engine.Engine.maybeFailEngine(Engine.java:514) >> at >> org.elasticsearch.index.engine.InternalEngine.maybeFailEngine(InternalEngine.java:928) >> at >> org.elasticsearch.index.engine.InternalEngine.createSearcherManager(InternalEngine.java:195) >> at >> org.elasticsearch.index.engine.InternalEngine.<init>(InternalEngine.java:146) >> at >> org.elasticsearch.index.engine.InternalEngineFactory.newReadWriteEngine(InternalEngineFactory.java:32) >> at >> org.elasticsearch.index.shard.IndexShard.newEngine(IndexShard.java:1355) >> at >> org.elasticsearch.index.shard.IndexShard.createNewEngine(IndexShard.java:1350) >> at >> org.elasticsearch.index.shard.IndexShard.prepareForTranslogRecovery(IndexShard.java:870) >> at >> org.elasticsearch.index.gateway.local.LocalIndexShardGateway.recover(LocalIndexShardGateway.java:233) >> at >> org.elasticsearch.index.gateway.IndexShardGatewayService$1.run(IndexShardGatewayService.java:112) >> at >> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) >> at >> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) >> at java.lang.Thread.run(Thread.java:745) >> Caused by: org.apache.lucene.index.CorruptIndexException: codec footer >> mismatch: actual footer=0 vs expected footer=-1071082520 (resource: >> MMapIndexInput(path="/var/lib/elasticsearch-data/esgl2/gl2/nodes/0/indices/graylog2_6/2/index/_ew6_Lucene41_0.tim")) >> at org.apache.lucene.codecs.CodecUtil.validateFooter(CodecUtil.java:235) >> at org.apache.lucene.codecs.CodecUtil.retrieveChecksum(CodecUtil.java:228) >> at >> org.apache.lucene.codecs.blocktree.BlockTreeTermsReader.<init>(BlockTreeTermsReader.java:137) >> at >> org.apache.lucene.codecs.lucene41.Lucene41PostingsFormat.fieldsProducer(Lucene41PostingsFormat.java:441) >> at >> org.apache.lucene.codecs.perfield.PerFieldPostingsFormat$FieldsReader.<init>(PerFieldPostingsFormat.java:197) >> at >> org.apache.lucene.codecs.perfield.PerFieldPostingsFormat.fieldsProducer(PerFieldPostingsFormat.java:254) >> at >> org.apache.lucene.index.SegmentCoreReaders.<init>(SegmentCoreReaders.java:120) >> at org.apache.lucene.index.SegmentReader.<init>(SegmentReader.java:108) >> at >> org.apache.lucene.index.ReadersAndUpdates.getReader(ReadersAndUpdates.java:145) >> at >> org.apache.lucene.index.ReadersAndUpdates.getReadOnlyClone(ReadersAndUpdates.java:239) >> at >> org.apache.lucene.index.StandardDirectoryReader.open(StandardDirectoryReader.java:109) >> at org.apache.lucene.index.IndexWriter.getReader(IndexWriter.java:421) >> at org.apache.lucene.index.DirectoryReader.open(DirectoryReader.java:112) >> at >> org.apache.lucene.search.SearcherManager.<init>(SearcherManager.java:89) >> at >> org.elasticsearch.index.engine.InternalEngine.createSearcherManager(InternalEngine.java:186) >> ... 10 more >> ]; ]] >> [2016-01-13 17:52:12,353][WARN ][cluster.action.shard ] >> [gl-es01-esgl2] [graylog2_6][2] received shard failed for [graylog2_6][2], >> node[2c5VN2XqSNe8DG2Ix8BY9A], [P], s[INITIALIZING], >> unassigned_info[[reason=CLUSTER_RECOVERED], at[2016-01-13T23:52:09.754Z]], >> indexUUID [xBeW-OuARQqCHqjkHSVbVw], reason [master >> [gl-es01-esgl2][2c5VN2XqSNe8DG2Ix8BY9A][gl-es01.brazos.tamu.edu][inet[/ >> 192.168.200.93:9300]] marked shard as initializing, but shard is marked >> as failed, resend shard failure] >> >> >> On Friday, January 15, 2016 at 3:19:22 AM UTC-6, Jochen Schalanda wrote: >>> >>> Hi Trey, >>> >>> do you see any other error messages in the logs of your Elasticsearch >>> node(s)? Graylog won't start writing into the Elasticsearch cluster again >>> until the health of index "graylog2_6" (which is probably the current >>> deflector target) is GREEN again. >>> >>> >>> Cheers, >>> Jochen >>> >>> On Wednesday, 13 January 2016 19:08:41 UTC+1, Trey Dockendorf wrote: >>>> >>>> Our data center recently had a catastrophic power loss. Once >>>> everything was back up Graylog refuses to start [1] and the issue seems to >>>> be a corrupt elasticsearch indexes [2]. I've attempted rerouting the >>>> indexes but that has not worked. I fear my only option is to delete the >>>> corrupt indexes, but I'm unsure what kind of impact that will have on >>>> Graylog. >>>> >>>> Any advice is greatly appreciated. >>>> >>>> Thanks, >>>> - Trey >>>> >>>> [1]: >>>> 2016-01-13T12:01:24.886-06:00 WARN [BlockingBatchedESOutput] Error >>>> while waiting for healthy Elasticsearch cluster. Not flushing. >>>> java.util.concurrent.TimeoutException: Elasticsearch cluster didn't get >>>> healthy within timeout >>>> at >>>> org.graylog2.indexer.cluster.Cluster.waitForConnectedAndHealthy(Cluster.java:174) >>>> at >>>> org.graylog2.indexer.cluster.Cluster.waitForConnectedAndHealthy(Cluster.java:179) >>>> at >>>> org.graylog2.outputs.BlockingBatchedESOutput.flush(BlockingBatchedESOutput.java:112) >>>> at >>>> org.graylog2.outputs.BlockingBatchedESOutput.write(BlockingBatchedESOutput.java:105) >>>> at >>>> org.graylog2.buffers.processors.OutputBufferProcessor$1.run(OutputBufferProcessor.java:189) >>>> at >>>> com.codahale.metrics.InstrumentedExecutorService$InstrumentedRunnable.run(InstrumentedExecutorService.java:176) >>>> at >>>> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471) >>>> at java.util.concurrent.FutureTask.run(FutureTask.java:262) >>>> at >>>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) >>>> at >>>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) >>>> at java.lang.Thread.run(Thread.java:745) >>>> 2016-01-13T12:01:51.526-06:00 INFO [IndexRetentionThread] >>>> Elasticsearch cluster not available, skipping index retention checks. >>>> >>>> [2]: >>>> # curl -XGET http://localhost:9200/_cat/shards >>>> graylog2_6 0 p STARTED 2700869 398.6mb 192.168.200.93 gl-es01-esgl2 >>>> graylog2_6 3 p STARTED 2694087 397mb 192.168.200.93 gl-es01-esgl2 >>>> graylog2_6 1 p UNASSIGNED >>>> graylog2_6 2 p UNASSIGNED >>>> graylog2_4 0 p STARTED 5040142 796.3mb 192.168.200.93 gl-es01-esgl2 >>>> graylog2_4 3 p STARTED 5003263 789.1mb 192.168.200.93 gl-es01-esgl2 >>>> graylog2_4 1 p STARTED 5001091 788.6mb 192.168.200.93 gl-es01-esgl2 >>>> graylog2_4 2 p STARTED 4958409 783.6mb 192.168.200.93 gl-es01-esgl2 >>>> graylog2_5 0 p STARTED 5019303 743.4mb 192.168.200.93 gl-es01-esgl2 >>>> graylog2_5 3 p STARTED 5001404 739.3mb 192.168.200.93 gl-es01-esgl2 >>>> graylog2_5 1 p STARTED 5000525 739.3mb 192.168.200.93 gl-es01-esgl2 >>>> graylog2_5 2 p STARTED 4979658 737.7mb 192.168.200.93 gl-es01-esgl2 >>>> graylog2_2 0 p STARTED 5023249 985.4mb 192.168.200.93 gl-es01-esgl2 >>>> graylog2_2 3 p STARTED 4999096 979.1mb 192.168.200.93 gl-es01-esgl2 >>>> graylog2_2 1 p STARTED 5001476 980.1mb 192.168.200.93 gl-es01-esgl2 >>>> graylog2_2 2 p STARTED 4976833 973.5mb 192.168.200.93 gl-es01-esgl2 >>>> graylog2_3 0 p STARTED 5000546 1gb 192.168.200.93 gl-es01-esgl2 >>>> graylog2_3 3 p STARTED 4998766 1gb 192.168.200.93 gl-es01-esgl2 >>>> graylog2_3 1 p STARTED 4999378 1gb 192.168.200.93 gl-es01-esgl2 >>>> graylog2_3 2 p STARTED 5001326 1gb 192.168.200.93 gl-es01-esgl2 >>>> graylog2_0 0 p STARTED 3686796 819.8mb 192.168.200.93 gl-es01-esgl2 >>>> graylog2_0 3 p STARTED 3655173 812.8mb 192.168.200.93 gl-es01-esgl2 >>>> graylog2_0 1 p STARTED 3655623 813mb 192.168.200.93 gl-es01-esgl2 >>>> graylog2_0 2 p STARTED 3625428 805.7mb 192.168.200.93 gl-es01-esgl2 >>>> graylog2_1 0 p STARTED 5053805 1gb 192.168.200.93 gl-es01-esgl2 >>>> graylog2_1 3 p STARTED 5000588 1gb 192.168.200.93 gl-es01-esgl2 >>>> graylog2_1 1 p STARTED 5001749 1gb 192.168.200.93 gl-es01-esgl2 >>>> graylog2_1 2 p STARTED 4943861 1gb 192.168.200.93 gl-es01-esgl2 >>>> >>>> -- You received this message because you are subscribed to the Google Groups "Graylog Users" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To view this discussion on the web visit https://groups.google.com/d/msgid/graylog2/0e358e7e-7a39-401f-bc5e-dc826677919f%40googlegroups.com. For more options, visit https://groups.google.com/d/optout.
