Basu, Disclaimer: I don’t work for MarkLogic, although once upon a time I did. I’m now an independent consultant and have encountered this sort of thing many times, but I don’t speak for MarkLogic. If this is impacting your operations and you have a support contract you should contact MarkLogic support if you can’t resolve the problem yourself in a timely manner.
Since it looks like there’s evidence of lock contention, you should dig deeper into that. On the performance history graphs, you can change the displayed resolution. The graph you included shows sustained lock activity for two three hours periods from 6:00-9:00 and 11:00-14:00. Your first task is to find out what was happening then. Next, drill into those times. Your graph has a granularity of one hour. Set the timespan to cover only 6:00-9:00 and a granularity of one minute or so. That will help you determine how spiky the load really is. IE, is the lock storm continuous or is it oscillating? So a little context on locking, and locking in clusters. There are two kinds of requests in MarkLogic: queries and update requests. Queries are requests that can’t make updates (read-only), update requests might make updates (whether they actually do or not). The type of a request is determined before it starts running. If there is any code path from the main module to a builtin function that makes updates then the request runs as an update. Otherwise it’s a query. Queries always run lock-free. Because of the nature of MVCC databases, queries need not worry about their view of the database changing. So pure queries can’t be the source of lock contentions. Updates need to acquire locks for any documents they touch, to insure that things don’t change while they’re running. Reading a document requires a read lock, updating a document must acquire a write lock (which will block if read locks are in place and block new read locks while held). That’s a fairly complex protocol that keeps multiple requests from stepping on each other. When there are multiple nodes in the cluster, all this locking information must be sent to every other node node as well. Then there’s deadlock detection. When deadlocks are detected (which occur much more easily in a cluster), one of the contending requests is randomly killed (which releases its locks), rolled back and restarted. If this happens frequently, it can result in a lot repeated work. There is also some time that must be spent waiting to determine that a deadlock has in fact occurred. OK, with all that out of the way, there are a few common scenarios that can produce lock contention. The most common is probably poor update hygiene. This is when you have a request that looks for things that need to be updated, potentially filtering through thousands (or millions) of documents. If this is all in one update request, you will be read-locking all those candidate documents before making your update(s). The right way to do this is to find the update candidates in a query request, then use xdmp:invoke/eval to run a short-lived update request which will grab the needed locks, update, and then immediately release them. Note: this kind of scenario may not manifest easily on your single-node development system, but fail horrendously on a cluster. Another case I’ve seen, which I did to myself recently, is having one or a few status documents that need to be updated repeatedly. Even a relatively slow update rate can cause requests to bump into each other, leading to the deadlock breaking action described above. Updating will require a write lock, which will block other requests trying either to read or update the document. Once a small pileup starts, it just gets worse and worse. Depending on the request load, this can persist for hours (which may apply in your case). Loading stuff with Content Pump should not cause locking problems. There will not be any contention for the same documents as they’re being created. Are you running any additional code against loaded documents that might be updating some other, common document? Another common scenario is the accidental “return the world” query. For example, say you have a service endpoint that typically accepts a list of IDs. Those IDs are used as a search predicate to find the (usually one or a few) matching documents. But a mis-behaving service client fails to provide the list of IDs and your search returns millions of documents that must be iterated over — in the context of an update request. Yeah, that unexpectedly happened to me... Lastly, if you can catch the system in this condition, and it’s responsive, check the monitoring dashboard on http://hostname:8002/dashboard/ to see if there a few very long running queries, or zillions of the same ones clogging up the system. Sorry for the long-winded response, hopefully this is helpful. Cheers. ---- Ron Hitchens [email protected], +44 7879 358212 On August 1, 2017 at 7:26:37 PM, Basavaraj Kalloli ( [email protected]) wrote: Hi Ron, I looked at the history and one thing clearly shows is that there are quite a few lock holds as you can see from the attachment. I am not really sure when and how these get added. How do I know which process or xquery is actually getting these locks? I definitely dont see anything in the logs (right now its at notice level) I remember reading on the Marklogic guides that maybe I should try debug but that will definitely will log quite a lot in production which I want to avoid. One thing that I have seen is that we do have a process which uses Marklogic's content pump which is used for ingesting content - maybe thats holding locks? Can xqueries running with <options xmlns="xdmp:eval"> <isolation>different-transaction</isolation> <prevent-deadlocks>true</prevent-deadlocks> </options>) be holding quite a lot of locks. There are also quite a lot of XQueries that update documents, I am not sure if all of those updates could block the reads or the ones which run under a different transaction as above are the cause of the problems? The one thing that I have always struggled is how to dig out the xquery/s that are really holding these locks. If you could share some more things I can try will be really appreciated. Thanks, Basu On Tue, Aug 1, 2017 at 3:02 PM, Ron Hitchens <[email protected]> wrote: > > Take a look at the cluster performance history: hostname:8002/history > > Mysterious slowdowns like this, especially in a cluster, are usually > some form of resource contention. Severe contention can cause other > requests, even trivial ones, to run slowly or get stalled. In a cluster, > lock contention can manifest like this. When there is more than one node > in a cluster, locks need to be communicated to all nodes and conflicts > resolved. It’s actually quite easy to induce lock storms with some kinds > of requests. > > Another thing about clusters is that they can run along nicely until > some limit is reached and then, because of the cluster communication > overhead, performance can fall off a cliff. Perhaps you’re bumping up > against the size limits of the cluster. > > Look at the performance history and see if there are spikes in lock > contention, disk I/O, etc. It may give you a clue to what’s going on. > Ideally, the various graph lines should remain fairly flat, without radical > spikes. > > ---- > Ron Hitchens [email protected], +44 7879 358212 <+44%207879%20358212> > > On August 1, 2017 at 1:49:55 PM, Basavaraj Kalloli ( > [email protected]) wrote: > > Hi All, > > We faced a strange issue today when the http server on Marklogic was > responding really slowly. We use the check to determine whether the server > is up and running. > > We have a HA proxy infront of marklogic which load balances requests to > Marklogic. The check in HA proxy to verify if the server is healthy is: > > curl http://x.x.x.x:8300 > > But the server was responding really slowly on this, because of this the > Load Balancer assumed that the server was down. But at this time we could > login to Admin and QConsole and do queries. We are not sure what caused > this huge latency. The number of threads for the server is set to 32, we > couldn't find anything in the logs and the Marklogic dashboard is not > really helpful either. > > Can someone please share a few ideas we could try to investigate this a > bit further? We have been facing similar issues and the Marklogic Dashboard > is pretty useless, any debugging help would be appreciated. > > Thanks, > Basu > _______________________________________________ > General mailing list > [email protected] > Manage your subscription at: > http://developer.marklogic.com/mailman/listinfo/general > > > _______________________________________________ > General mailing list > [email protected] > Manage your subscription at: > http://developer.marklogic.com/mailman/listinfo/general > > _______________________________________________ General mailing list [email protected] Manage your subscription at: http://developer.marklogic.com/mailman/listinfo/general
_______________________________________________ General mailing list [email protected] Manage your subscription at: http://developer.marklogic.com/mailman/listinfo/general
