Re: [Neo4j] Sudden failure state with slow/non-existant recovery in Neo4j

Ryan Sattler Tue, 11 Nov 2014 18:36:47 -0800

I've made some progress on this. I've added a timeout to the config as 
described here: 
http://www.markhneedham.com/blog/2013/10/17/neo4j-setting-query-timeout/ . 
This seems to keep the DB responsive even after there have been some 
timeouts. (Previously I'd added the 'max-execution-time' header but that 
didn't seem to help). Theoretically I don't think I have any queries that 
should be ultra-long-running in the first place (ie nothing that touches 
the whole DB) but at least having the timeout seems to mitigate the problem.


Unfortunately I don't think I could easily replicate what our code is doing 
in a simple script, it's pretty complex in terms of what goes on in each 
transaction (lots of reads and writes).

Thanks,
Ryan

On Wednesday, November 12, 2014 11:45:47 AM UTC+11, Michael Hunger wrote:
>
> Those locks _should_ resolve themselves. And it is not ok that they don't.
>
> Could you write a script that reproduces the scenario? That would be very 
> helpful for us to fix any underlying issue.
>
>
> On Tue, Nov 11, 2014 at 11:49 PM, Ryan Sattler <[email protected] 
> <javascript:>> wrote:
>
>> If we delete the data only, the indexes are still defined (eg they show 
>> up with :schema), so data coming from other sources will still be indexed 
>> correctly. If we delete graph.db, the indexes are gone and have to be 
>> re-specified somehow or future reads will have bad performance. We have an 
>> endpoint to do this but the overall complexity of 
>> stop-server-delete-files-restart-server-rebuild-indexes is relatively high. 
>> What we really want is just a "truncate database" query that simply deletes 
>> all data without acquiring locks while keeping index definitions. This 
>> would be very useful for testing compared to cobbling together some 
>> relatively complex delete script.
>>
>> If your own internal testing of Neo is doing a lot of datafile deletes I 
>> suspect you might be overlooking a crippling performance issue that can 
>> occur if one doesn't do this and I will make a separate post about that.
>>
>> At any rate thanks for the advice and we will have a think to see if we 
>> can do larger transactions. Despite all this I'm still concerned about the 
>> behaviour of Neo in these cases - it just doesn't seem right that these 
>> locks get stuck and don't resolve themselves in a reasonable amount of 
>> time, even with transaction timeouts of a few seconds?
>>
>> --
>> Ryan
>>
>> On Tuesday, November 11, 2014 5:38:45 PM UTC+11, Michael Hunger wrote:
>>>
>>> I think the queries timing out are just an indication of all processing 
>>> threads blocked on locks, and then adding delete queries on top of that 
>>> doesn't make it better.
>>>
>>> But if you delete the data anyway the indexes will be emptied as well?
>>>
>>> Addressing the root cause of not getting locks, is the solution here:
>>> - larger grained, grouped tx
>>> - queuing and grouping / aggregating of requests (you can easily have 
>>> 30k-50k ops in one requests with your memory setup)
>>>
>>> Michael
>>>
>>> On Tue, Nov 11, 2014 at 3:10 AM, Ryan Sattler <[email protected]> wrote:
>>>
>>>> Yes it's faster to do that, but that also deletes indexes so it's more 
>>>> convenient to just delete the contents. At any rate it seems like there is 
>>>> a problem if once a few queries time out, it becomes seemingly permanently 
>>>> impossible to do a delete request - who knows what other queries will be 
>>>> broken or slow too? IE, my concern is not so much with the delete query 
>>>> specifically as that Neo is getting into a bad state and not recovering, 
>>>> which could be bad if it happened in production.
>>>>
>>>> Thanks for the suggestion on deleting relationships first. We'll also 
>>>> keep the number of threads at/below the number of cores. We've reduced 
>>>> deadlocks by sorting some of the nodes being written but we also need to 
>>>> handle retries for constraint violations (due to two transactions writing 
>>>> the same unique node at the same time).
>>>>
>>>> We are actually already doing a certain amount of grouping for 
>>>> correctness reasons rather than performance, but I don't think we can do 
>>>> what's suggested in that blog post because for correctness we need to 
>>>> write 
>>>> whole trees of many types of nodes and with possible 
>>>> relationships-to/reuse-of existing nodes as a single transaction, and it's 
>>>> hard to do two different trees in one transaction.
>>>>
>>>> On Tuesday, November 11, 2014 11:41:59 AM UTC+11, Michael Hunger wrote:
>>>>>
>>>>> Hi Ryan,
>>>>>
>>>>> so are you deleting the whole db?
>>>>>
>>>>> Wouldn't it be faster to shutdown the server clean out the graph.db 
>>>>> directory and restart?
>>>>> Also the id's of the deleted elements are not reused until the next 
>>>>> restart.
>>>>>
>>>>> Something I could imagine would help for your delete operation is to 
>>>>> separate deleting rels first
>>>>> and then nodes.
>>>>>
>>>>> start r=rel(*) with r limit 100000 delete r return count(*);
>>>>>
>>>>> match (n) with n limit 100000 delete n return count(*);
>>>>>
>>>>> In general you might want to think about queuing and tagging by 
>>>>> "important-node" your requests and trying to group them by those tags, 
>>>>> Max 
>>>>> wrote a bit up about that:
>>>>> http://maxdemarzi.com/2013/09/05/scaling-writes/
>>>>>
>>>>> Reducing the number of threads to the number of cores you have should 
>>>>> also help to not use them all up while not progressing.
>>>>> Some of your writes might fail in deadlocks and would have to be 
>>>>> retried.
>>>>>
>>>>> HTH
>>>>> Michael
>>>>>
>>>>>
>>>>>
>>>>> On Tue, Nov 11, 2014 at 1:27 AM, Ryan Sattler <[email protected]> 
>>>>> wrote:
>>>>>
>>>>>> Here is the thread dump for a situation where isolated delete queries 
>>>>>> have started timing out: http://pastebin.com/u3Ecsp8i
>>>>>>
>>>>>> We do share some nodes quite heavily (ie they'll be involved in many 
>>>>>> transactions), so it could be a locking issue. I don't think we can 
>>>>>> really 
>>>>>> do all those in 1 transaction especially since when we're writing we 
>>>>>> don't 
>>>>>> know what already exists in the DB.
>>>>>>
>>>>>> On Tuesday, November 11, 2014 10:36:43 AM UTC+11, Michael Hunger 
>>>>>> wrote:
>>>>>>>
>>>>>>> I would have betted GC, but perhaps it was locking. So threads 
>>>>>>> waiting for each other to either finish writing the commit logs (which 
>>>>>>> is 
>>>>>>> synchronized) or waiting for other threads to release locks on nodes 
>>>>>>> that 
>>>>>>> are updated concurrently. Is there any means for you to group nodes 
>>>>>>> that 
>>>>>>> are updated somehow?
>>>>>>>
>>>>>>> The threaddump (with jstack <pid>) would show blocked threads it 
>>>>>>> would be _very_ valuable.
>>>>>>>
>>>>>>> 200 writes / tx sounds ok. If there was one write per tx I'd be more 
>>>>>>> worried.
>>>>>>> In 2.2. we'll add internal tx write batching.
>>>>>>>
>>>>>>> Cheers, Michael
>>>>>>>
>>>>>>>
>>>>>>> On Mon, Nov 10, 2014 at 11:24 PM, Ryan Sattler <[email protected]> 
>>>>>>> wrote:
>>>>>>>
>>>>>>>> I will try that next time that happens, but one time when I was 
>>>>>>>> able to cancel it normally (running as ./neo4j console), it took a 
>>>>>>>> while to 
>>>>>>>> stop and then had the following errors:
>>>>>>>>
>>>>>>>> 16:36:01.151 [Thread-192] WARN  o.e.j.util.thread.QueuedThreadPool 
>>>>>>>> - qtp1198728399{STOPPING,8<=62<=80,i=0,q=0} Couldn't stop 
>>>>>>>> Thread[qtp1198728399-7727,5,main]
>>>>>>>> 16:36:01.151 [Thread-192] WARN  o.e.j.util.thread.QueuedThreadPool 
>>>>>>>> - qtp1198728399{STOPPING,8<=62<=80,i=0,q=0} Couldn't stop 
>>>>>>>> Thread[qtp1198728399-9861,5,main]
>>>>>>>> 16:36:01.151 [Thread-192] WARN  o.e.j.util.thread.QueuedThreadPool 
>>>>>>>> - qtp1198728399{STOPPING,8<=62<=80,i=0,q=0} Couldn't stop 
>>>>>>>> Thread[qtp1198728399-10008,5,main]
>>>>>>>> 16:36:01.151 [Thread-192] WARN  o.e.j.util.thread.QueuedThreadPool 
>>>>>>>> - qtp1198728399{STOPPING,8<=62<=80,i=0,q=0} Couldn't stop 
>>>>>>>> Thread[qtp1198728399-10024,5,main]
>>>>>>>> 16:36:01.151 [Thread-192] WARN  o.e.j.util.thread.QueuedThreadPool 
>>>>>>>> - qtp1198728399{STOPPING,8<=62<=80,i=0,q=0} Couldn't stop 
>>>>>>>> Thread[qtp1198728399-10429,5,main]
>>>>>>>> 16:36:01.151 [Thread-192] WARN  o.e.j.util.thread.QueuedThreadPool 
>>>>>>>> - qtp1198728399{STOPPING,8<=62<=80,i=0,q=0} Couldn't stop 
>>>>>>>> Thread[qtp1198728399-10730,5,main]
>>>>>>>> 16:36:01.151 [Thread-192] WARN  o.e.j.util.thread.QueuedThreadPool 
>>>>>>>> - qtp1198728399{STOPPING,8<=62<=80,i=0,q=0} Couldn't stop 
>>>>>>>> Thread[qtp1198728399-10733,5,main]
>>>>>>>> 16:36:01.151 [Thread-192] WARN  o.e.j.util.thread.QueuedThreadPool 
>>>>>>>> - qtp1198728399{STOPPING,8<=62<=80,i=0,q=0} Couldn't stop 
>>>>>>>> Thread[qtp1198728399-10741,5,main]
>>>>>>>> 16:36:01.151 [Thread-192] WARN  o.e.j.util.thread.QueuedThreadPool 
>>>>>>>> - qtp1198728399{STOPPING,8<=62<=80,i=0,q=0} Couldn't stop 
>>>>>>>> Thread[qtp1198728399-10747,5,main]
>>>>>>>> 16:36:01.151 [Thread-192] WARN  o.e.j.util.thread.QueuedThreadPool 
>>>>>>>> - qtp1198728399{STOPPING,8<=62<=80,i=0,q=0} Couldn't stop 
>>>>>>>> Thread[qtp1198728399-10748,5,main]
>>>>>>>> 16:36:01.151 [Thread-192] WARN  o.e.j.util.thread.QueuedThreadPool 
>>>>>>>> - qtp1198728399{STOPPING,8<=62<=80,i=0,q=0} Couldn't stop 
>>>>>>>> Thread[qtp1198728399-10750,5,main]
>>>>>>>> 16:36:01.151 [Thread-192] WARN  o.e.j.util.thread.QueuedThreadPool 
>>>>>>>> - qtp1198728399{STOPPING,8<=62<=80,i=0,q=0} Couldn't stop 
>>>>>>>> Thread[qtp1198728399-10751,5,main]
>>>>>>>> 16:36:01.151 [Thread-192] WARN  o.e.j.util.thread.QueuedThreadPool 
>>>>>>>> - qtp1198728399{STOPPING,8<=62<=80,i=0,q=0} Couldn't stop 
>>>>>>>> Thread[qtp1198728399-10756,5,main]
>>>>>>>> 16:36:01.151 [Thread-192] WARN  o.e.j.util.thread.QueuedThreadPool 
>>>>>>>> - qtp1198728399{STOPPING,8<=62<=80,i=0,q=0} Couldn't stop 
>>>>>>>> Thread[qtp1198728399-10757,5,main]
>>>>>>>> 16:36:01.151 [Thread-192] WARN  o.e.j.util.thread.QueuedThreadPool 
>>>>>>>> - qtp1198728399{STOPPING,8<=62<=80,i=0,q=0} Couldn't stop 
>>>>>>>> Thread[qtp1198728399-10758,5,main]
>>>>>>>> 16:36:01.151 [Thread-192] WARN  o.e.j.util.thread.QueuedThreadPool 
>>>>>>>> - qtp1198728399{STOPPING,8<=62<=80,i=0,q=0} Couldn't stop 
>>>>>>>> Thread[qtp1198728399-10759,5,main]
>>>>>>>> 16:36:01.151 [Thread-192] WARN  o.e.j.util.thread.QueuedThreadPool 
>>>>>>>> - qtp1198728399{STOPPING,8<=62<=80,i=0,q=0} Couldn't stop 
>>>>>>>> Thread[qtp1198728399-10763,5,main]
>>>>>>>> 16:36:01.152 [Thread-192] WARN  o.e.j.util.thread.QueuedThreadPool 
>>>>>>>> - qtp1198728399{STOPPING,8<=62<=80,i=0,q=0} Couldn't stop 
>>>>>>>> Thread[qtp1198728399-10766,5,main]
>>>>>>>> 16:36:01.152 [Thread-192] WARN  o.e.j.util.thread.QueuedThreadPool 
>>>>>>>> - qtp1198728399{STOPPING,8<=62<=80,i=0,q=0} Couldn't stop 
>>>>>>>> Thread[qtp1198728399-10767,5,main]
>>>>>>>> 16:36:01.152 [Thread-192] WARN  o.e.j.util.thread.QueuedThreadPool 
>>>>>>>> - qtp1198728399{STOPPING,8<=62<=80,i=0,q=0} Couldn't stop 
>>>>>>>> Thread[qtp1198728399-10770,5,main]
>>>>>>>> 16:36:01.152 [Thread-192] WARN  o.e.j.util.thread.QueuedThreadPool 
>>>>>>>> - qtp1198728399{STOPPING,8<=62<=80,i=0,q=0} Couldn't stop 
>>>>>>>> Thread[qtp1198728399-10771,5,main]
>>>>>>>> 16:36:01.152 [Thread-192] WARN  o.e.j.util.thread.QueuedThreadPool 
>>>>>>>> - qtp1198728399{STOPPING,8<=62<=80,i=0,q=0} Couldn't stop 
>>>>>>>> Thread[qtp1198728399-10772,5,main]
>>>>>>>> 16:36:01.152 [Thread-192] WARN  o.e.j.util.thread.QueuedThreadPool 
>>>>>>>> - qtp1198728399{STOPPING,8<=62<=80,i=0,q=0} Couldn't stop 
>>>>>>>> Thread[qtp1198728399-10774,5,main]
>>>>>>>> 16:36:01.152 [Thread-192] WARN  o.e.j.util.thread.QueuedThreadPool 
>>>>>>>> - qtp1198728399{STOPPING,8<=62<=80,i=0,q=0} Couldn't stop 
>>>>>>>> Thread[qtp1198728399-10775,5,main]
>>>>>>>> 16:36:01.152 [Thread-192] WARN  o.e.j.util.thread.QueuedThreadPool 
>>>>>>>> - qtp1198728399{STOPPING,8<=62<=80,i=0,q=0} Couldn't stop 
>>>>>>>> Thread[qtp1198728399-10776,5,main]
>>>>>>>> 16:36:01.152 [Thread-192] WARN  o.e.j.util.thread.QueuedThreadPool 
>>>>>>>> - qtp1198728399{STOPPING,8<=62<=80,i=0,q=0} Couldn't stop 
>>>>>>>> Thread[qtp1198728399-10777,5,main]
>>>>>>>> 16:36:01.152 [Thread-192] WARN  o.e.j.util.thread.QueuedThreadPool 
>>>>>>>> - qtp1198728399{STOPPING,8<=62<=80,i=0,q=0} Couldn't stop 
>>>>>>>> Thread[qtp1198728399-10780,5,main]
>>>>>>>> 16:36:01.152 [Thread-192] WARN  o.e.j.util.thread.QueuedThreadPool 
>>>>>>>> - qtp1198728399{STOPPING,8<=62<=80,i=0,q=0} Couldn't stop 
>>>>>>>> Thread[qtp1198728399-10781,5,main]
>>>>>>>> 16:36:01.152 [Thread-192] WARN  o.e.j.util.thread.QueuedThreadPool 
>>>>>>>> - qtp1198728399{STOPPING,8<=62<=80,i=0,q=0} Couldn't stop 
>>>>>>>> Thread[qtp1198728399-10782,5,main]
>>>>>>>> 16:36:01.152 [Thread-192] WARN  o.e.j.util.thread.QueuedThreadPool 
>>>>>>>> - qtp1198728399{STOPPING,8<=62<=80,i=0,q=0} Couldn't stop 
>>>>>>>> Thread[qtp1198728399-10783,5,main]
>>>>>>>> 16:36:01.152 [Thread-192] WARN  o.e.j.util.thread.QueuedThreadPool 
>>>>>>>> - qtp1198728399{STOPPING,8<=62<=80,i=0,q=0} Couldn't stop 
>>>>>>>> Thread[qtp1198728399-10784,5,main]
>>>>>>>> 16:36:01.152 [Thread-192] WARN  o.e.j.util.thread.QueuedThreadPool 
>>>>>>>> - qtp1198728399{STOPPING,8<=62<=80,i=0,q=0} Couldn't stop 
>>>>>>>> Thread[qtp1198728399-10785,5,main]
>>>>>>>> 16:36:01.152 [Thread-192] WARN  o.e.j.util.thread.QueuedThreadPool 
>>>>>>>> - qtp1198728399{STOPPING,8<=62<=80,i=0,q=0} Couldn't stop 
>>>>>>>> Thread[qtp1198728399-10791,5,main]
>>>>>>>> 16:36:01.153 [Thread-192] WARN  o.e.j.util.thread.QueuedThreadPool 
>>>>>>>> - qtp1198728399{STOPPING,8<=62<=80,i=0,q=0} Couldn't stop 
>>>>>>>> Thread[qtp1198728399-10792,5,main]
>>>>>>>> 16:36:01.153 [Thread-192] WARN  o.e.j.util.thread.QueuedThreadPool 
>>>>>>>> - qtp1198728399{STOPPING,8<=62<=80,i=0,q=0} Couldn't stop 
>>>>>>>> Thread[qtp1198728399-10793,5,main]
>>>>>>>> 16:36:01.153 [Thread-192] WARN  o.e.j.util.thread.QueuedThreadPool 
>>>>>>>> - qtp1198728399{STOPPING,8<=62<=80,i=0,q=0} Couldn't stop 
>>>>>>>> Thread[qtp1198728399-10794,5,main]
>>>>>>>> 16:36:01.153 [Thread-192] WARN  o.e.j.util.thread.QueuedThreadPool 
>>>>>>>> - qtp1198728399{STOPPING,8<=62<=80,i=0,q=0} Couldn't stop 
>>>>>>>> Thread[qtp1198728399-10801,5,main]
>>>>>>>> 16:36:01.153 [Thread-192] WARN  o.e.j.util.thread.QueuedThreadPool 
>>>>>>>> - qtp1198728399{STOPPING,8<=62<=80,i=0,q=0} Couldn't stop 
>>>>>>>> Thread[qtp1198728399-10803,5,main]
>>>>>>>> 16:36:01.153 [Thread-192] WARN  o.e.j.util.thread.QueuedThreadPool 
>>>>>>>> - qtp1198728399{STOPPING,8<=62<=80,i=0,q=0} Couldn't stop 
>>>>>>>> Thread[qtp1198728399-10805,5,main]
>>>>>>>> 16:36:01.153 [Thread-192] WARN  o.e.j.util.thread.QueuedThreadPool 
>>>>>>>> - qtp1198728399{STOPPING,8<=62<=80,i=0,q=0} Couldn't stop 
>>>>>>>> Thread[qtp1198728399-10806,5,main]
>>>>>>>> 16:36:01.153 [Thread-192] WARN  o.e.j.util.thread.QueuedThreadPool 
>>>>>>>> - qtp1198728399{STOPPING,8<=62<=80,i=0,q=0} Couldn't stop 
>>>>>>>> Thread[qtp1198728399-10807,5,main]
>>>>>>>> 16:36:01.153 [Thread-192] WARN  o.e.j.util.thread.QueuedThreadPool 
>>>>>>>> - qtp1198728399{STOPPING,8<=62<=80,i=0,q=0} Couldn't stop 
>>>>>>>> Thread[qtp1198728399-10808,5,main]
>>>>>>>> 16:36:01.153 [Thread-192] WARN  o.e.j.util.thread.QueuedThreadPool 
>>>>>>>> - qtp1198728399{STOPPING,8<=62<=80,i=0,q=0} Couldn't stop 
>>>>>>>> Thread[qtp1198728399-10832,5,main]
>>>>>>>> 16:36:01.153 [Thread-192] WARN  o.e.j.util.thread.QueuedThreadPool 
>>>>>>>> - qtp1198728399{STOPPING,8<=62<=80,i=0,q=0} Couldn't stop 
>>>>>>>> Thread[qtp1198728399-10837,5,main]
>>>>>>>> 16:36:01.153 [Thread-192] WARN  o.e.j.util.thread.QueuedThreadPool 
>>>>>>>> - qtp1198728399{STOPPING,8<=62<=80,i=0,q=0} Couldn't stop 
>>>>>>>> Thread[qtp1198728399-10839,5,main]
>>>>>>>> 16:36:01.153 [Thread-192] WARN  o.e.j.util.thread.QueuedThreadPool 
>>>>>>>> - qtp1198728399{STOPPING,8<=62<=80,i=0,q=0} Couldn't stop 
>>>>>>>> Thread[qtp1198728399-10840,5,main]
>>>>>>>> 16:36:01.153 [Thread-192] WARN  o.e.j.util.thread.QueuedThreadPool 
>>>>>>>> - qtp1198728399{STOPPING,8<=62<=80,i=0,q=0} Couldn't stop 
>>>>>>>> Thread[qtp1198728399-10841,5,main]
>>>>>>>> 16:36:01.153 [Thread-192] WARN  o.e.j.util.thread.QueuedThreadPool 
>>>>>>>> - qtp1198728399{STOPPING,8<=62<=80,i=0,q=0} Couldn't stop 
>>>>>>>> Thread[qtp1198728399-10844,5,main]
>>>>>>>> 16:36:01.153 [Thread-192] WARN  o.e.j.util.thread.QueuedThreadPool 
>>>>>>>> - qtp1198728399{STOPPING,8<=62<=80,i=0,q=0} Couldn't stop 
>>>>>>>> Thread[qtp1198728399-10847,5,main]
>>>>>>>> 16:36:01.154 [Thread-192] WARN  o.e.j.util.thread.QueuedThreadPool 
>>>>>>>> - qtp1198728399{STOPPING,8<=62<=80,i=0,q=0} Couldn't stop 
>>>>>>>> Thread[qtp1198728399-10852,5,main]
>>>>>>>> 16:36:01.154 [Thread-192] WARN  o.e.j.util.thread.QueuedThreadPool 
>>>>>>>> - qtp1198728399{STOPPING,8<=62<=80,i=0,q=0} Couldn't stop 
>>>>>>>> Thread[qtp1198728399-10854,5,main]
>>>>>>>> 16:36:01.154 [Thread-192] WARN  o.e.j.util.thread.QueuedThreadPool 
>>>>>>>> - qtp1198728399{STOPPING,8<=62<=80,i=0,q=0} Couldn't stop 
>>>>>>>> Thread[qtp1198728399-10862,5,main]
>>>>>>>> 16:36:01.154 [Thread-192] WARN  o.e.j.util.thread.QueuedThreadPool 
>>>>>>>> - qtp1198728399{STOPPING,8<=62<=80,i=0,q=0} Couldn't stop 
>>>>>>>> Thread[qtp1198728399-10863,5,main]
>>>>>>>> 16:36:01.154 [Thread-192] WARN  o.e.j.util.thread.QueuedThreadPool 
>>>>>>>> - qtp1198728399{STOPPING,8<=62<=80,i=0,q=0} Couldn't stop 
>>>>>>>> Thread[qtp1198728399-10864,5,main]
>>>>>>>> 16:36:01.154 [Thread-192] WARN  o.e.j.util.thread.QueuedThreadPool 
>>>>>>>> - qtp1198728399{STOPPING,8<=62<=80,i=0,q=0} Couldn't stop 
>>>>>>>> Thread[qtp1198728399-10866,5,main]
>>>>>>>> 16:36:01.154 [Thread-192] WARN  o.e.j.util.thread.QueuedThreadPool 
>>>>>>>> - qtp1198728399{STOPPING,8<=62<=80,i=0,q=0} Couldn't stop 
>>>>>>>> Thread[qtp1198728399-10870,5,main]
>>>>>>>> 16:36:01.164 [Thread-192] WARN  o.e.j.util.thread.QueuedThreadPool 
>>>>>>>> - qtp1198728399{STOPPING,8<=62<=80,i=0,q=0} Couldn't stop 
>>>>>>>> Thread[qtp1198728399-10878,5,main]
>>>>>>>> 16:36:01.164 [Thread-192] WARN  o.e.j.util.thread.QueuedThreadPool 
>>>>>>>> - qtp1198728399{STOPPING,8<=62<=80,i=0,q=0} Couldn't stop 
>>>>>>>> Thread[qtp1198728399-10880,5,main]
>>>>>>>> 16:36:01.164 [Thread-192] WARN  o.e.j.util.thread.QueuedThreadPool 
>>>>>>>> - qtp1198728399{STOPPING,8<=62<=80,i=0,q=0} Couldn't stop 
>>>>>>>> Thread[qtp1198728399-10881,5,main]
>>>>>>>> 16:36:01.164 [Thread-192] WARN  o.e.j.util.thread.QueuedThreadPool 
>>>>>>>> - qtp1198728399{STOPPING,8<=62<=80,i=0,q=0} Couldn't stop 
>>>>>>>> Thread[qtp1198728399-10883,5,main]
>>>>>>>> 16:36:01.165 [Thread-192] WARN  o.e.j.util.thread.QueuedThreadPool 
>>>>>>>> - qtp1198728399{STOPPING,8<=62<=80,i=0,q=0} Couldn't stop 
>>>>>>>> Thread[qtp1198728399-10885,5,main]
>>>>>>>>
>>>>>>>> It also gave the "Detected incorrectly shut down database, 
>>>>>>>> performing recovery.." error on restart even though it was shut down 
>>>>>>>> normally (by ctrl-c on the console process).
>>>>>>>>
>>>>>>>> I also realized that I forgot to include context for my question, 
>>>>>>>> so here we go:
>>>>>>>>
>>>>>>>> * Neo4j 2.1.5
>>>>>>>> * Cypher writes via REST
>>>>>>>> * Up to 200 writes per transaction
>>>>>>>> * Few hundred thousand total nodes
>>>>>>>> * 2GB RAM allocated w/ CMS collector
>>>>>>>> * Doesn't seem to be having garbage collection problems according 
>>>>>>>> to New Relic
>>>>>>>>
>>>>>>>> On Tuesday, November 11, 2014 3:27:43 AM UTC+11, Mark Needham wrote:
>>>>>>>>>
>>>>>>>>> What's the state of the system if you take a thread dump when it's 
>>>>>>>>> in unresponsive mood? 
>>>>>>>>>
>>>>>>>>> On 10 November 2014 05:13, Ryan Sattler <[email protected]> wrote:
>>>>>>>>>
>>>>>>>>>> (I'm posting this question here since it seems a bit too 
>>>>>>>>>> non-specific for Stack Overflow).
>>>>>>>>>>
>>>>>>>>>> I've been working on a Neo4j system that needs to ingest a lot of 
>>>>>>>>>> data. We've been testing the data-load with many writing threads. 
>>>>>>>>>> Generally 
>>>>>>>>>> this works fine. For example, writing with 5 threads completes 
>>>>>>>>>> normally. 
>>>>>>>>>> Writing with 10 threads however works fine (in fact ~2x speed) for a 
>>>>>>>>>> while, 
>>>>>>>>>> but then suddenly every transaction starts timing out, even though 
>>>>>>>>>> CPU 
>>>>>>>>>> usage was generally less than 50%. 
>>>>>>>>>>
>>>>>>>>>> Also, it doesn't fully recover for a long time if ever. Even 
>>>>>>>>>> minutes after the load has been completely turned off, Neo still 
>>>>>>>>>> remains 
>>>>>>>>>> unresponsive to certain kinds of queries - for example, I have a 
>>>>>>>>>> script 
>>>>>>>>>> that runs "MATCH (a) WITH a LIMIT 10000 OPTIONAL MATCH (a)-[r]-() 
>>>>>>>>>> DELETE 
>>>>>>>>>> a,r RETURN COUNT(*)" repeatedly which normally works fine when run 
>>>>>>>>>> by 
>>>>>>>>>> itself, but in this case it keeps timing out even when there are no 
>>>>>>>>>> other 
>>>>>>>>>> queries. I have to kill and restart the process. Needless to say 
>>>>>>>>>> this 
>>>>>>>>>> behaviour is less than ideal. I tried adding a "max-execution-time" 
>>>>>>>>>> header 
>>>>>>>>>> of 5 seconds but this didn't seem to help much if at all.
>>>>>>>>>>
>>>>>>>>>> In other tests I've seen weird spikes of bad performance 30 mins 
>>>>>>>>>> to a few hours after one of these meltdowns, in a kind of "echo", 
>>>>>>>>>> even when 
>>>>>>>>>> the load has been greatly reduced in the meantime (and when normally 
>>>>>>>>>> that 
>>>>>>>>>> load would sustain good performance indefinitely).
>>>>>>>>>>
>>>>>>>>>> Any idea what's going on?
>>>>>>>>>>
>>>>>>>>>> TL;DR: Neo fails abruptly under heavy load and seems slow to 
>>>>>>>>>> recover even when the load is removed.
>>>>>>>>>>
>>>>>>>>>> Thanks,
>>>>>>>>>> Ryan Sattler
>>>>>>>>>>
>>>>>>>>>> -- 
>>>>>>>>>> You received this message because you are subscribed to the 
>>>>>>>>>> Google Groups "Neo4j" group.
>>>>>>>>>> To unsubscribe from this group and stop receiving emails from it, 
>>>>>>>>>> send an email to [email protected].
>>>>>>>>>> For more options, visit https://groups.google.com/d/optout.
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>>  -- 
>>>>>>>> You received this message because you are subscribed to the Google 
>>>>>>>> Groups "Neo4j" group.
>>>>>>>> To unsubscribe from this group and stop receiving emails from it, 
>>>>>>>> send an email to [email protected].
>>>>>>>> For more options, visit https://groups.google.com/d/optout.
>>>>>>>>
>>>>>>>
>>>>>>>  -- 
>>>>>> You received this message because you are subscribed to the Google 
>>>>>> Groups "Neo4j" group.
>>>>>> To unsubscribe from this group and stop receiving emails from it, 
>>>>>> send an email to [email protected].
>>>>>> For more options, visit https://groups.google.com/d/optout.
>>>>>>
>>>>>
>>>>>  -- 
>>>> You received this message because you are subscribed to the Google 
>>>> Groups "Neo4j" group.
>>>> To unsubscribe from this group and stop receiving emails from it, send 
>>>> an email to [email protected].
>>>> For more options, visit https://groups.google.com/d/optout.
>>>>
>>>
>>>  -- 
>> You received this message because you are subscribed to the Google Groups 
>> "Neo4j" group.
>> To unsubscribe from this group and stop receiving emails from it, send an 
>> email to [email protected] <javascript:>.
>> For more options, visit https://groups.google.com/d/optout.
>>
>
>

-- 
You received this message because you are subscribed to the Google Groups 
"Neo4j" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
For more options, visit https://groups.google.com/d/optout.

Re: [Neo4j] Sudden failure state with slow/non-existant recovery in Neo4j

Reply via email to