[
https://issues.apache.org/jira/browse/GEODE-2125?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15685048#comment-15685048
]
Karen Smoler Miller commented on GEODE-2125:
--------------------------------------------
Along with this task, I plan to add a note to the documentation that talks
about Starting up and Shutting Down the System. The note will say something to
the point of:
Don't {{kill -9}} system members, as it causes the system to go into a state
that you won't be able to get out of. This is especially true for small systems
(like 1 locator and 1 server).
If you feel you must use {{kill -9}} on a system member, use it on all members.
Kill the whole darn thing, not just a piece.
What developers should do is to use the appropriate gfsh command to stop a
server.
> GFSH should provide information about Locators that go into reconnect mode
> --------------------------------------------------------------------------
>
> Key: GEODE-2125
> URL: https://issues.apache.org/jira/browse/GEODE-2125
> Project: Geode
> Issue Type: Improvement
> Components: management
> Affects Versions: 1.0.0-incubating
> Reporter: Kirk Lund
> Assignee: Kirk Lund
> Attachments: locator_failure-logs.txt, thread_dump.txt
>
>
> If the Locator is started from GFSH and the cluster's only server is killed,
> network partition detection will initiate forceDisconnect in the Locator and
> leave it in reconnect mode. To the User it will appear that the Locator
> crashed and GFSH lost connection:
> {noformat}
> gfsh>
> No longer connected to 192.168.1.72[1099].
> {noformat}
> During the time in which the Locator is in reconnect mode, the User cannot
> connect via GFSH, nor can they issue status or stop commands against it:
> {noformat}
> $ cd locator1
> $ cat vf.gf.locator.pid
> 33959
> $ ps 33959
> PID TT STAT TIME COMMAND
> 33959 s001 S 0:19.97
> /Library/Java/JavaVirtualMachines/jdk1.8.0_66.jdk/Co
> {noformat}
> In GFSH:
> {noformat}
> gfsh>connect --locator=localhost[10334]
> Connecting to Locator at [host=localhost, port=10334] ..
> Connection refused
> gfsh>status locator --pid=33959
> null
> gfsh>status locator --dir=locator1
> null
> gfsh>stop locator --dir=locator1
> Locator in /Users/klund/dev/geode/locator1 on null is currently not
> responding.
> gfsh>stop locator --pid=33959
> Locator in /Users/klund/dev/geode on null is currently not responding.
> {noformat}
> If a Locator has GFSH connected then it should notify GFSH that it is going
> to forceDisconnect and go into reconnect mode. Then GFSH can notify the User
> so the User is not suprised.
> In addition, GFSH status and stop commands should be modified to be able to
> talk to a Locator in reconnect mode. GFSH start could also be modified to
> report that the Locator is running in reconnect mode instead of reporting a
> hung process in the Locator's directory.
> Attachments:
> * The Locator log file is attached as locator_failure-logs.txt
> * The Locator thread dump (via jstack) AFTER it has shut down due to
> forceDisconnect is attached as thread_dump.txt
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)