Re: [Neo4j] REST server hangs if server.webserver.maxthreads is too low- how many are needed?

Joel Welling Thu, 27 Mar 2014 08:15:20 -0700

I have to amend my previous statement a bit- on the smaller front end 
machine (SGI UV, 24GB, 16 cores*2 for hyperthreading) the server runs fine 
at 320 threads but ***hangs at 32***. I've attached jstack dumps for the 
server on the big machine at 32 threads (jstack_out3) and in the hung state 
on the small machine at 32 threads (jstack_small_32).


On Thursday, March 27, 2014 10:39:43 AM UTC-4, Joel Welling wrote:
>
> Hi folks-
>   First, thanks for your attention!
>   Yes, the thread dumps were taken in the hung condition. There was 
> definitely a request pending, at least as far as the 'curl' command was 
> concerned.
>   I've tried with 32 threads and seen the same result, but I'll repeat the 
> experiment and provide the thread dump.  I'll note that there is a 'front 
> end' SGI UV with fewer cores which can see the same file system, and neo4j 
> runs just fine on that machine from the same config files for which it 
> hangs on the big machine.
>   I wouldn't say we are optimizing for any use case at the moment, just 
> trying to get things to run.  But our use case would be low concurrency, 
> large data volume.  The goal is to provide graph analytics support for NSF 
> researchers with large graphical datasets, so the datasets would be large 
> but would be accessed by only one group.
>   And I'll see about setting up a temporary account to let you folks try 
> stuff. It will have to be under a person's name, and we'll have to have a 
> physical mail address and email address for that person.  If possible, 
> could you pick someone and email that info to me at welling at psc.edu ? 
>
> Thanks very much for your attention!
> -Joel
>
> On Wednesday, March 26, 2014 8:26:54 PM UTC-4, Jacob Hansson wrote:
>>
>> Joel: Thanks for the detailed info on this, super helpful, and I'm sorry 
>> you're running into issues on this system (which, holy crap, that is a huge 
>> machine). 
>>
>> Looking through the stack traces, the entire system seems idle waiting 
>> for requests from the network. Is this thread dump taken while the system 
>> is hung? If yes, try lowering the number of threads even more, to see if it 
>> has something to do with jetty choking on the network selection somehow 
>> (and I mean low as in give it 32 threads, to clearly rule out choking in 
>> the network layer). If this thread dump is from when the the system is 
>> idle, could you send a thread dump from when it is hung after issuing a 
>> request like you described?
>>
>> Thanks a lot!
>> Jake
>>
>>
>> On Wed, Mar 26, 2014 at 10:57 PM, Joel Welling <[email protected]>wrote:
>>
>>> Hi Michael-
>>>
>>>
>>> >  PS: Your machine is really impressive, I want to have one too :)
>>> >  2014-03-25 21:02:11.800+0000 INFO  [o.n.k.i.DiagnosticsManager]: 
>>> Total Physical memory: 15.62 TB
>>> >  2014-03-25 21:02:11.801+0000 INFO  [o.n.k.i.DiagnosticsManager]: Free 
>>> Physical memory: 12.71 TB
>>>
>>> Thank you!  It's this machine: 
>>> http://www.psc.edu/index.php/computing-resources/blacklight .  For 
>>> quite a while it was the world's largest shared-memory environment.  If you 
>>> want to try some timings, it could probably be arranged.
>>>
>>> Anyway, I made the mods to neo4j.properties which you suggested and I'm 
>>> afraid it still hangs in the same place.  (I'll look into the disk 
>>> scheduler issue, but I can't make a global change like that on short 
>>> notice).  The new jstack and messages.log are attached.
>>>
>>> On Tuesday, March 25, 2014 7:39:06 PM UTC-4, Michael Hunger wrote:
>>>
>>>> Joel,
>>>>
>>>> I looked at your logs. It seems there is a problem with the automatic 
>>>> calculation for the MMIO for the neo4j store files.
>>>>
>>>> Could you uncomment the first lines in the conf/neo4j.properties 
>>>> related to memory mapping?
>>>> Just the default values should be good to get going
>>>>
>>>> Otherwise it is 14 bytes per node, 33 bytes per relationship and 38 
>>>> bytes per 4 properties per node or rel.
>>>>
>>>> I currently tries to map several terabytes of memory :) which is 
>>>> definitely not ok!
>>>>
>>>> 2014-03-25 21:18:34.849+0000 INFO  [o.n.k.i.n.s.StoreFactory]: 
>>>> [data/graph.db/neostore.propertystore.db.strings] brickCount=0 
>>>> brickSize=1516231424b mappedMem=1516231458816b (storeSize=128b)
>>>>  2014-03-25 21:18:34.960+0000 INFO  [o.n.k.i.n.s.StoreFactory]: 
>>>> [data/graph.db/neostore.propertystore.db.arrays] brickCount=0 
>>>> brickSize=1718395776b mappedMem=1718395863040b (storeSize=128b)
>>>> 2014-03-25 21:18:35.117+0000 INFO  [o.n.k.i.n.s.StoreFactory]: 
>>>> [data/graph.db/neostore.propertystore.db] brickCount=0 
>>>> brickSize=1783801801b mappedMem=1783801839616b (storeSize=41b)
>>>> 2014-03-25 21:18:35.192+0000 INFO  [o.n.k.i.n.s.StoreFactory]: 
>>>> [data/graph.db/neostore.relationshipstore.db] brickCount=0 
>>>> brickSize=2147483646b mappedMem=2186031398912b (storeSize=33b)
>>>> 2014-03-25 21:18:35.350+0000 INFO  [o.n.k.i.n.s.StoreFactory]: 
>>>> [data/graph.db/neostore.nodestore.db.labels] brickCount=0 brickSize=0b 
>>>> mappedMem=0b (storeSize=68b)
>>>> 2014-03-25 21:18:35.525+0000 INFO  [o.n.k.i.n.s.StoreFactory]: 
>>>> [data/graph.db/neostore.nodestore.db] brickCount=0 
>>>> brickSize=495500390b mappedMem=495500394496b (storeSize=14b)
>>>>
>>>> You should probably also switch your disk scheduler to deadline or noop 
>>>> instead of the currently configured cfq
>>>>
>>>> Please ping me if that helped.
>>>>
>>>> Cheers
>>>>
>>>> Michael
>>>>
>>>> PS: Your machine is really impressive, I want to have one too :)
>>>> 2014-03-25 21:02:11.800+0000 INFO  [o.n.k.i.DiagnosticsManager]: Total 
>>>> Physical memory: 15.62 TB
>>>> 2014-03-25 21:02:11.801+0000 INFO  [o.n.k.i.DiagnosticsManager]: Free 
>>>> Physical memory: 12.71 TB
>>>>
>>>>
>>>>
>>>> On Tue, Mar 25, 2014 at 10:28 PM, Joel Welling <[email protected]>wrote:
>>>>
>>>>> Thank you very much for your extremely quick reply! The curl session 
>>>>> with the X-Stream:true flag is below; as you can see it still hangs.  The 
>>>>> graph database is actually empty.  The actual response of the server to 
>>>>> the 
>>>>> curl message is at the end of the non-hung curl transcript above.
>>>>>
>>>>> The configuration for the server is exactly as in the community 
>>>>> download, except for the following:
>>>>> In neo4j.properties:
>>>>>  org.neo4j.server.http.log.enabled=true
>>>>>  org.neo4j.server.http.log.config=conf/neo4j-http-logging.xml
>>>>>  org.neo4j.server.webserver.port=9494
>>>>>  org.neo4j.server.webserver.https.port=9493
>>>>>  org.neo4j.server.webserver.maxthreads=320
>>>>> In 
>>>>>  wrapper.java.additional=-XX:ParallelGCThreads=32
>>>>>  wrapper.java.additional=-XX:ConcGCThreads=32
>>>>>
>>>>> I've attached the jstack thread dump and data/graph.db/messages.log 
>>>>> files to this message.  The hung curl session looks like:
>>>>>  > curl --trace-ascii - -X POST -H X-Stream:true -H "Content-Type: 
>>>>> application/json" -d '{"query":"start a= node(*) return a"}' 
>>>>> http://localhost:9494/db/data/cypher
>>>>> == Info: About to connect() to localhost port 9494 (#0)
>>>>> == Info:   Trying 127.0.0.1... == Info: connected
>>>>> == Info: Connected to localhost (127.0.0.1) port 9494 (#0)
>>>>> => Send header, 237 bytes (0xed)
>>>>> 0000: POST /db/data/cypher HTTP/1.1
>>>>> 001f: User-Agent: curl/7.19.0 (x86_64-suse-linux-gnu) libcurl/7.19.0 O
>>>>> 005f: penSSL/0.9.8h zlib/1.2.7 libidn/1.10
>>>>> 0085: Host: localhost:9494
>>>>> 009b: Accept: */*
>>>>> 00a8: X-Stream:true
>>>>> 00b7: Content-Type: application/json
>>>>> 00d7: Content-Length: 37
>>>>> 00eb: 
>>>>> => Send data, 37 bytes (0x25)
>>>>> 0000: {"query":"start a= node(*) return a"}
>>>>> ...and at this point it hangs...
>>>>>
>>>>>
>>>>> On Tuesday, March 25, 2014 3:19:36 PM UTC-4, Michael Hunger wrote:
>>>>>
>>>>>> Joel,
>>>>>>
>>>>>> can you add the X-Stream:true header?
>>>>>>
>>>>>> How many nodes do you have in your graph? If you return them all it 
>>>>>> is quite a amount of data that's returned. Without the streaming header, 
>>>>>> the Server builds up the response in memory and that most probably 
>>>>>> causes 
>>>>>> GC pauses or it just blows up with an OOM.
>>>>>>
>>>>>> What is your memory config for your Neo4j Server? Both in terms of 
>>>>>> heap and mmio config?
>>>>>>
>>>>>> Any chance to share your data/graph.db/messages.log for some 
>>>>>> diagnostics?
>>>>>>
>>>>>> A thread dump in the case when it hangs would be also super helpful, 
>>>>>> either with jstack <pid> or kill -3 <pid>  (in the second case they'll 
>>>>>> end 
>>>>>> up in data/log/console.log)
>>>>>>
>>>>>> Thanks so much,
>>>>>>
>>>>>> Michael
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Tue, Mar 25, 2014 at 8:04 PM, Joel Welling <[email protected]>wrote:
>>>>>>
>>>>>>> Hi folks;
>>>>>>>   I am running neo4j on an SGI UV machine.  It has a great many 
>>>>>>> cores, but only a small subset (limited by the cpuset) are available to 
>>>>>>> my 
>>>>>>> neo4j server.  If I run neo4j community-2.0.1 with a configuration 
>>>>>>> which is 
>>>>>>> out-of-the-box except for setting -XX:ParallelGCThreads=32 and 
>>>>>>> -XX:ConcGCThreads=32 in my neo4j-wrapper.conf, too many threads are 
>>>>>>> allocated for the cores I actually have.  
>>>>>>>   I can prevent this by setting server.webserver.maxthreads to some 
>>>>>>> value, but the REST interface then hangs.  For example, here is a curl 
>>>>>>> command which works if maxthreads is not set but hangs if it is set, 
>>>>>>> even 
>>>>>>> with a relatively large value like 320 threads:
>>>>>>>
>>>>>>>
>>>>>>> > curl --trace-ascii - -X POST -H "Content-Type: application/json" 
>>>>>>> -d '{"query":"start a= node(*) return a"}' 
>>>>>>> http://localhost:9494/db/data/cypher
>>>>>>> == Info: About to connect() to localhost port 9494 (#0)
>>>>>>> == Info:   Trying 127.0.0.1... == Info: connected
>>>>>>> == Info: Connected to localhost (127.0.0.1) port 9494 (#0)
>>>>>>> => Send header, 213 bytes (0xd5)
>>>>>>> 0000: POST /db/data/cypher HTTP/1.1
>>>>>>> 001f: User-Agent: curl/7.21.3 (x86_64-unknown-linux-gnu) 
>>>>>>> libcurl/7.21.
>>>>>>> 005f: 3 OpenSSL/0.9.8h zlib/1.2.7
>>>>>>> 007c: Host: localhost:9494
>>>>>>> 0092: Accept: */*
>>>>>>> 009f: Content-Type: application/json
>>>>>>> 00bf: Content-Length: 37
>>>>>>> 00d3: 
>>>>>>> => Send data, 37 bytes (0x25)
>>>>>>> 0000: {"query":"start a= node(*) return a"}   
>>>>>>> <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<< HANGS AT THIS POINT 
>>>>>>> <= Recv header, 17 bytes (0x11)
>>>>>>> 0000: HTTP/1.1 200 OK
>>>>>>> <= Recv header, 47 bytes (0x2f)
>>>>>>> 0000: Content-Type: application/json; charset=UTF-8
>>>>>>> <= Recv header, 32 bytes (0x20)
>>>>>>> 0000: Access-Control-Allow-Origin: *
>>>>>>> <= Recv header, 20 bytes (0x14)
>>>>>>> 0000: Content-Length: 41
>>>>>>> <= Recv header, 32 bytes (0x20)
>>>>>>> 0000: Server: Jetty(9.0.5.v20130815)
>>>>>>> <= Recv header, 2 bytes (0x2)
>>>>>>> 0000: 
>>>>>>> <= Recv data, 41 bytes (0x29)
>>>>>>> 0000: {.  "columns" : [ "a" ],.  "data" : [ ].}
>>>>>>> {
>>>>>>>   "columns" : [ "a" ],
>>>>>>>   "data" : [ ]
>>>>>>> }== Info: Connection #0 to host localhost left intact
>>>>>>> == Info: Closing connection #0
>>>>>>>
>>>>>>> If I were on a 32-core machine rather than a 2000-core machine, 
>>>>>>> maxthreads=320 would be the default.  Thus I'm guessing that something 
>>>>>>> is 
>>>>>>> competing for threads within that 320-thread pool, or else the server 
>>>>>>> is 
>>>>>>> internally calculating a ratio of threads-per-core and that ratio is 
>>>>>>> yielding zero on my machine. Is there any way to work around this?
>>>>>>>
>>>>>>> Thanks,
>>>>>>> -Joel Welling
>>>>>>>
>>>>>>>  -- 
>>>>>>> You received this message because you are subscribed to the Google 
>>>>>>> Groups "Neo4j" group.
>>>>>>> To unsubscribe from this group and stop receiving emails from it, 
>>>>>>> send an email to [email protected].
>>>>>>>
>>>>>>> For more options, visit https://groups.google.com/d/optout.
>>>>>>>
>>>>>>
>>>>>>  -- 
>>>>> You received this message because you are subscribed to the Google 
>>>>> Groups "Neo4j" group.
>>>>> To unsubscribe from this group and stop receiving emails from it, send 
>>>>> an email to [email protected].
>>>>> For more options, visit https://groups.google.com/d/optout.
>>>>>
>>>>
>>>>  -- 
>>> You received this message because you are subscribed to the Google 
>>> Groups "Neo4j" group.
>>> To unsubscribe from this group and stop receiving emails from it, send 
>>> an email to [email protected].
>>> For more options, visit https://groups.google.com/d/optout.
>>>
>>
>>

-- 
You received this message because you are subscribed to the Google Groups 
"Neo4j" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
For more options, visit https://groups.google.com/d/optout.

jstack_out3.gz
Description: Binary data

jstack_small_32.gz
Description: Binary data

Re: [Neo4j] REST server hangs if server.webserver.maxthreads is too low- how many are needed?

Reply via email to