[DISCUSS] Add health check API endpoint to Gremlin Server

Taylor Riggan Wed, 12 Jul 2023 09:28:21 -0700

A number of users have asked for the ability to query a Gremlin Server and
check for its health ahead of sending queries:

https://stackoverflow.com/questions/46505790/gremlin-server-health-check-endpoint-for-aws-elb
https://stackoverflow.com/questions/59396980/gremlin-query-to-check-connection-health

Various TinkerPop implementers also have architectures that require using
multiple Gremlin Servers for high availability. While some of the Gremlin
Language Variants include the ability to define a Gremlin Server cluster
using multiple endpoints, not all do. Some architectures require the use
of a load balancer in front of a Gremlin Server cluster. In this
configuration, a load balancer needs to poll each of the backend Gremlin
Server instances to determine their health and whether or not it should
route requests to a given server.

Today, Gremlin Server has no means to supply health other than returning
the "no gremlin script supplied" error message or via using a simple
Gremlin query such as g.inject(0).

I'm proposing that we add a /status API to Gremlin Server for the purpose
of providing the health of a Gremlin server instance. The /status API can
also be used to return additional telemetry of a given Gremlin Server, such
as the uptime, TinkerPop version, and configuration information.

Here is a proposed response structure, unless others have better ideas of
what might be included:

{
"status":"healthy",
"startTime":"Wed Jul 12 13:50:20 UTC 2023",
"gremlin-server-version":"3.6.0",
"settings":{
"channelizer":"WebSocketChannelizer",
"host":"localhost",
"port":8182
}
}

This could be extended to include additional parameters from any TinkerPop
implementation. As an example, we should include the ability for projects
like Janusgraph to include a Janusgraph version, and/or additional
configuration details of the underlying storage layer being used with
Janusgraph. A reference implementation could be included with the project
to include a status response for TinkerGraph and return the various enabled
features of the underlying graph object. Given the length of such a
response, we could potentially parameterize the call to include a summary
as a default with additional details via something like
`/status/?details=true`.

Another feature that this could potentially expose is the new(ish) Service
Registry. Users could leverage the /status API to fetch a list of services
available to be used with the call() step.

There are a lot of options this could enable. Looking for consensus from
the community on whether we should implement such an API in this manner.

One additional item we will need to consider: Today, any http requests are
accepted regardless of the server route being used. This means that
connections can be established to the root route ('/'), the default Gremlin
route ('/gremlin') or any arbitrary route ('/hokey-pokey'). This behavior
would need to be altered to allow for defined routes. We may be able to
support accepting query/connection requests on the root route to avoid any
breaking changes for applications that may have failed to use /gremlin.

Looking forward to feedback.

Thank you,

Taylor Riggan

[DISCUSS] Add health check API endpoint to Gremlin Server

Reply via email to