Invalid epoch means the server recieved a message from an older (no longer
valid) logical time epoch in the cluster, which, less cryptically,
generally there has been an election and the new master (possibly the same
instance as the previous master) recieved a message that was addressed to
the old master.

That is, this is a protection mechanism that ensures messages meant for a
logically different master are not acted on by the current master. we
should probably not log these exceptions, they are expected in normal
operation.

The co-occurrence of these and the long pause you observed indicate
something causes the master to stall long enough to cause a re-election. I
would advise looking for gc monitor messages in our messages.log to see if
there are large gc pauses.

Since you are running neo in process with another java application, it
becomes very important to ensure that application makes very little GC
noise, otherwise it may disturb neo in this way.

Would you consider running two neo4j servers and talking to neo from jboss
via the neo4j jdbc driver?

Sent from my phone, please excuse typos and brevity.
On Mar 27, 2014 1:20 AM, "Virat Gohil" <[email protected]> wrote:

> Hi,
>
> We are using Neo4j version 1.9.6 enterprise in a clustered formation, the
> following are some of the stats for the DB.
>
> ~30 Million nodes
> ~54 Million relationships.
>
> The cluster is formed by having one instance of neo4j embedded in a Jboss
> server and another instance running in stand alone mode exposing the REST
> interface. We also have an arbiter running to avoid the constant
> re-election issues.
>
> Every now and then (ranges from 5 hours to a day), we observe that the
> writes to the embedded server are either failing or takes a very long time
> to finish (>5 seconds).
>
> Whenever, we observe this behavior, the following is what gets printed in
> messages.log of the embedded server:
>
> at java.lang.Thread.run(Thread.java:744) [na:1.7.0_51]
> 2014-03-27 00:11:15.625+0000 ERROR [o.n.k.h.c.m.MasterServer]: Could not
> finish off dead channel
> org.neo4j.kernel.ha.com.master.InvalidEpochException: Invalid epoch
> 282870793924543, correct epoch is 282870850589260
>         at
> org.neo4j.kernel.ha.com.master.MasterImpl.assertCorrectEpoch(MasterImpl.java:218)
> ~[neo4j-ha-1.9.6.jar:1.9.6]
>         at
> org.neo4j.kernel.ha.com.master.MasterImpl.finishTransaction(MasterImpl.java:419)
> ~[neo4j-ha-1.9.6.jar:1.9.6]
>         at
> org.neo4j.kernel.ha.com.master.MasterServer.finishOffChannel(MasterServer.java:69)
> ~[neo4j-ha-1.9.6.jar:1.9.6]
>         at org.neo4j.com.Server.tryToFinishOffChannel(Server.java:408)
> ~[neo4j-com-1.9.6.jar:1.9.6]
>         at org.neo4j.com.Server$4.run(Server.java:586)
> [neo4j-com-1.9.6.jar:1.9.6]
>         at
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
> [na:1.7.0_51]
>         at java.util.concurrent.FutureTask.run(FutureTask.java:262)
> [na:1.7.0_51]
>         at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> [na:1.7.0_51]
>         at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> [na:1.7.0_51]
>
> The problem goes away if we restart the standalone neo4j instance (not
> where the logs are being printed).
>
> We would appreciate if someone points out what could possibly cause the
> issue and how to avoid it.
>
> We ensured that both the servers have correct times and synchronized the
> times using ntp.
>
> Thanks,
>
> Virat
>
>
>
>  --
> You received this message because you are subscribed to the Google Groups
> "Neo4j" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to [email protected].
> For more options, visit https://groups.google.com/d/optout.
>

-- 
You received this message because you are subscribed to the Google Groups 
"Neo4j" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
For more options, visit https://groups.google.com/d/optout.

Reply via email to