Hello Everyone,
                         We lost data after restarting Elasticsearch 
cluster. Restarting is a part of deploying our software stack. 

                         We have a 20-node cluster running 0.90.2 and we 
have Splunk configured to index ES logs.

                         Looking at the Splunk logs, we could find the 
following *error a day before the deployment* (restart) - 

                [cluster.action.shard     ] [Rictor] sending failed shard for 
[c0a71ddaa70b463a9a179c36c7fc26e3][2], node[nJvnclczRNaLbETunjlcWw], [R], 
s[STARTED], reason
                [Failed to perform [bulk/shard] on replica, message 
[RemoteTransportException; nested: ResponseHandlerFailureTransportException; 
nested: NullPointerException; ]]

                [cluster.action.shard     ] [Kiss] received shard failed for 
[c0a71ddaa70b463a9a179c36c7fc26e3][2], node[nJvnclczRNaLbETunjlcWw], [R], 
s[STARTED], reason 
                [Failed to perform [bulk/shard] on replica, message 
[RemoteTransportException; nested: ResponseHandlerFailureTransportException; 
nested: NullPointerException; ]]  



                          Further,* a day after the deploy,* we see the 
same errors on another node - 

                                

                [cluster.action.shard     ] [Contrary] received shard failed 
for [a58f9413315048ecb0abea48f5f6aae7][1], node[3UbHwVCkQvO3XroIl-awPw], [R], 
s[STARTED], reason
                [Failed to perform [bulk/shard] on replica, message 
[RemoteTransportException; nested: ResponseHandlerFailureTransportException; 
nested: NullPointerException; ]]

           
             *Immediately next, the following error is seen*. This error is 
seen repeatedly on a couple of other nodes as well - 
 
                 failed to start shard
 
                 [cluster.action.shard     ] [Copperhead] sending failed shard 
for [a58f9413315048ecb0abea48f5f6aae7][0], node[EuRzr3MLQiSS6lzTZJbiKw], [R], 
s[INITIALIZING],
                 reason [Failed to start shard, message 
[RecoveryFailedException[[a58f9413315048ecb0abea48f5f6aae7][0]: Recovery failed 
from [Frank Castle][dlv2mPypQaOxLPQhHQ67Fw]
                 [inet[/10.2.136.81:9300]] into 
[Copperhead][EuRzr3MLQiSS6lzTZJbiKw][inet[/10.3.207.55:9300]]]; nested: 
RemoteTransportException[[Frank Castle]
                 
[inet[/10.2.136.81:9300]][index/shard/recovery/startRecovery]]; nested: 
RecoveryEngineException[[a58f9413315048ecb0abea48f5f6aae7][0] Phase[2] 
Execution failed]; 
                 nested: 
RemoteTransportException[[Copperhead][inet[/10.3.207.55:9300]][index/shard/recovery/translogOps]];
 nested: InvalidAliasNameException[[a58f9413315048ecb0abea48f5f6aae7]
        
*         Invalid alias name 
[fbf1e55418a2327d308e7632911f9bb8bfed58059dd7f1e4abd3467c5f8519c3], Unknown 
alias name was passed to alias Filter]; ]]*
    
             
*During this time, we could not access previously indexed documents.*
             I looked up the alias error, looks like it is related to 
https://github.com/elasticsearch/elasticsearch/issues/1198 (Delete By Query 
wrongly persisted to translog # 1198),
             but this should be fixed in ES 0.18.0 and, we are using 0.90.2, so 
why is ES encountering this issue?

             What do we need to do to set this right and get back lost data? 
Please help.

Thanks.

             



-- 
You received this message because you are subscribed to the Google Groups 
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/00e54753-ab89-4f63-a39e-0931e8f7e2f0%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Reply via email to