Basu,

   Disclaimer: I don’t work for MarkLogic, although once upon a time I did.
I’m now an independent consultant and have encountered this sort of thing
many times, but I don’t speak for MarkLogic.  If this is impacting your
operations and you have a support contract you should contact MarkLogic
support if you can’t resolve the problem yourself in a timely manner.

   Since it looks like there’s evidence of lock contention, you should dig
deeper into that.  On the performance history graphs, you can change the
displayed resolution.  The graph you included shows sustained lock activity
for two three hours periods from 6:00-9:00 and 11:00-14:00.  Your first
task is to find out what was happening then.

   Next, drill into those times.  Your graph has a granularity of one
hour.  Set the timespan to cover only 6:00-9:00 and a granularity of one
minute or so.  That will help you determine how spiky the load really is.
IE, is the lock storm continuous or is it oscillating?

   So a little context on locking, and locking in clusters.  There are
two kinds of requests in MarkLogic: queries and update requests.  Queries
are requests that can’t make updates (read-only), update requests might
make updates (whether they actually do or not).  The type of a request is
determined before it starts running.  If there is any code path from the
main module to a builtin function that makes updates then the request runs
as an update.  Otherwise it’s a query.

   Queries always run lock-free.  Because of the nature of MVCC databases,
queries need not worry about their view of the database changing.  So pure
queries can’t be the source of lock contentions.

   Updates need to acquire locks for any documents they touch, to insure
that things don’t change while they’re running.  Reading a document
requires a read lock, updating a document must acquire a write lock (which
will block if read locks are in place and block new read locks while held).

   That’s a fairly complex protocol that keeps multiple requests
from stepping on each other.  When there are multiple nodes in the cluster,
all this locking information must be sent to every other node node as
well.  Then there’s deadlock detection.  When deadlocks are detected (which
occur much more easily in a cluster), one of the contending requests is
randomly killed (which releases its locks), rolled back and restarted.  If
this happens frequently, it can result in a lot repeated work.  There is
also some time that must be spent waiting to determine that a deadlock has
in fact occurred.

   OK, with all that out of the way, there are a few common scenarios that
can produce lock contention.

   The most common is probably poor update hygiene.  This is when you have
a request that looks for things that need to be updated, potentially
filtering through thousands (or millions) of documents.  If this is all in
one update request, you will be read-locking all those candidate documents
before making your update(s).  The right way to do this is to find the
update candidates in a query request, then use xdmp:invoke/eval to run a
short-lived update request which will grab the needed locks, update, and
then immediately release them.

  Note: this kind of scenario may not manifest easily on your single-node
development system, but fail horrendously on a cluster.

   Another case I’ve seen, which I did to myself recently, is having one or
a few status documents that need to be updated repeatedly.  Even a
relatively slow update rate can cause requests to bump into each other,
leading to the deadlock breaking action described above.  Updating will
require a write lock, which will block other requests trying either to read
or update the document.  Once a small pileup starts, it just gets worse and
worse.  Depending on the request load, this can persist for hours (which
may apply in your case).

   Loading stuff with Content Pump should not cause locking problems.
There will not be any contention for the same documents as they’re being
created.  Are you running any additional code against loaded documents that
might be updating some other, common document?

   Another common scenario is the accidental “return the world” query.  For
example, say you have a service endpoint that typically accepts a list of
IDs.  Those IDs are used as a search predicate to find the (usually one or
a few) matching documents.  But a mis-behaving service client fails to
provide the list of IDs and your search returns millions of documents that
must be iterated over — in the context of an update request.  Yeah, that
unexpectedly happened to me...

   Lastly, if you can catch the system in this condition, and it’s
responsive, check the monitoring dashboard on
http://hostname:8002/dashboard/ to see if there a few very long running
queries, or zillions of the same ones clogging up the system.

   Sorry for the long-winded response, hopefully this is helpful.

   Cheers.

----
Ron Hitchens [email protected], +44 7879 358212


On August 1, 2017 at 7:26:37 PM, Basavaraj Kalloli (
[email protected]) wrote:

Hi Ron,

I looked at the history and one thing clearly shows is that there are quite
a few lock holds as you can see from the attachment. I am not really sure
when and how these get added. How do I know which process or xquery is
actually getting these locks? I definitely dont see anything in the logs
(right now its at notice level) I remember reading on the Marklogic guides
that maybe I should try debug but that will definitely will log quite a lot
in production which I want to avoid.

One thing that I have seen is that we do have a process which uses
Marklogic's content pump which is used for ingesting content - maybe thats
holding locks? Can xqueries running with

<options xmlns="xdmp:eval">
    <isolation>different-transaction</isolation>
    <prevent-deadlocks>true</prevent-deadlocks>
</options>)

be holding quite a lot of locks. There are also quite a lot of XQueries
that update documents, I am not sure if all of those updates could block
the reads or the ones which run under a different transaction as above are
the cause of the problems? The one thing that I have always struggled is
how to dig out the xquery/s that are really holding these locks. If you
could share
some more things I can try will be really appreciated.

Thanks,
Basu

On Tue, Aug 1, 2017 at 3:02 PM, Ron Hitchens <[email protected]> wrote:

>
>    Take a look at the cluster performance history: hostname:8002/history
>
>    Mysterious slowdowns like this, especially in a cluster, are usually
> some form of resource contention.  Severe contention can cause other
> requests, even trivial ones, to run slowly or get stalled.  In a cluster,
> lock contention can manifest like this.  When there is more than one node
> in a cluster, locks need to be communicated to all nodes and conflicts
> resolved.  It’s actually quite easy to induce lock storms with some kinds
> of requests.
>
>    Another thing about clusters is that they can run along nicely until
> some limit is reached and then, because of the cluster communication
> overhead, performance can fall off a cliff.  Perhaps you’re bumping up
> against the size limits of the cluster.
>
>    Look at the performance history and see if there are spikes in lock
> contention, disk I/O, etc.  It may give you a clue to what’s going on.
> Ideally, the various graph lines should remain fairly flat, without radical
> spikes.
>
> ----
> Ron Hitchens [email protected], +44 7879 358212 <+44%207879%20358212>
>
> On August 1, 2017 at 1:49:55 PM, Basavaraj Kalloli (
> [email protected]) wrote:
>
> Hi All,
>
> We faced a strange issue today when the http server on Marklogic was
> responding really slowly. We use the check to determine whether the server
> is up and running.
>
> We have a HA proxy infront of marklogic which load balances requests to
> Marklogic. The check in HA proxy to verify if the server is healthy is:
>
> curl http://x.x.x.x:8300
>
> But the server was responding really slowly on this, because of this the
> Load Balancer assumed that the server was down. But at this time we could
> login to Admin and QConsole and do queries. We are not sure what caused
> this huge latency. The number of threads for the server is set to 32, we
> couldn't find anything in the logs and the Marklogic dashboard is not
> really helpful either.
>
> Can someone please share a few ideas we could try to investigate this a
> bit further? We have been facing similar issues and the Marklogic Dashboard
> is pretty useless, any debugging help would be appreciated.
>
> Thanks,
> Basu
> _______________________________________________
> General mailing list
> [email protected]
> Manage your subscription at:
> http://developer.marklogic.com/mailman/listinfo/general
>
>
> _______________________________________________
> General mailing list
> [email protected]
> Manage your subscription at:
> http://developer.marklogic.com/mailman/listinfo/general
>
>
_______________________________________________
General mailing list
[email protected]
Manage your subscription at:
http://developer.marklogic.com/mailman/listinfo/general
_______________________________________________
General mailing list
[email protected]
Manage your subscription at: 
http://developer.marklogic.com/mailman/listinfo/general

Reply via email to