Here's an interesting tidbit: on the small (16*2 core) machine, the server 
hangs at 32 or 33 threads but does not hang at 64.

On Thursday, March 27, 2014 11:14:49 AM UTC-4, Joel Welling wrote:
>
> I have to amend my previous statement a bit- on the smaller front end 
> machine (SGI UV, 24GB, 16 cores*2 for hyperthreading) the server runs fine 
> at 320 threads but ***hangs at 32***. I've attached jstack dumps for the 
> server on the big machine at 32 threads (jstack_out3) and in the hung state 
> on the small machine at 32 threads (jstack_small_32).
>
> On Thursday, March 27, 2014 10:39:43 AM UTC-4, Joel Welling wrote:
>>
>> Hi folks-
>>   First, thanks for your attention!
>>   Yes, the thread dumps were taken in the hung condition. There was 
>> definitely a request pending, at least as far as the 'curl' command was 
>> concerned.
>>   I've tried with 32 threads and seen the same result, but I'll repeat 
>> the experiment and provide the thread dump.  I'll note that there is a 
>> 'front end' SGI UV with fewer cores which can see the same file system, and 
>> neo4j runs just fine on that machine from the same config files for which 
>> it hangs on the big machine.
>>   I wouldn't say we are optimizing for any use case at the moment, just 
>> trying to get things to run.  But our use case would be low concurrency, 
>> large data volume.  The goal is to provide graph analytics support for NSF 
>> researchers with large graphical datasets, so the datasets would be large 
>> but would be accessed by only one group.
>>   And I'll see about setting up a temporary account to let you folks try 
>> stuff. It will have to be under a person's name, and we'll have to have a 
>> physical mail address and email address for that person.  If possible, 
>> could you pick someone and email that info to me at welling at psc.edu ? 
>>
>> Thanks very much for your attention!
>> -Joel
>>
>> On Wednesday, March 26, 2014 8:26:54 PM UTC-4, Jacob Hansson wrote:
>>>
>>> Joel: Thanks for the detailed info on this, super helpful, and I'm sorry 
>>> you're running into issues on this system (which, holy crap, that is a huge 
>>> machine). 
>>>
>>> Looking through the stack traces, the entire system seems idle waiting 
>>> for requests from the network. Is this thread dump taken while the system 
>>> is hung? If yes, try lowering the number of threads even more, to see if it 
>>> has something to do with jetty choking on the network selection somehow 
>>> (and I mean low as in give it 32 threads, to clearly rule out choking in 
>>> the network layer). If this thread dump is from when the the system is 
>>> idle, could you send a thread dump from when it is hung after issuing a 
>>> request like you described?
>>>
>>> Thanks a lot!
>>> Jake
>>>
>>>
>>> On Wed, Mar 26, 2014 at 10:57 PM, Joel Welling <[email protected]>wrote:
>>>
>>>> Hi Michael-
>>>>
>>>>
>>>> >  PS: Your machine is really impressive, I want to have one too :)
>>>> >  2014-03-25 21:02:11.800+0000 INFO  [o.n.k.i.DiagnosticsManager]: 
>>>> Total Physical memory: 15.62 TB
>>>> >  2014-03-25 21:02:11.801+0000 INFO  [o.n.k.i.DiagnosticsManager]: 
>>>> Free Physical memory: 12.71 TB
>>>>
>>>> Thank you!  It's this machine: 
>>>> http://www.psc.edu/index.php/computing-resources/blacklight .  For 
>>>> quite a while it was the world's largest shared-memory environment.  If 
>>>> you 
>>>> want to try some timings, it could probably be arranged.
>>>>
>>>> Anyway, I made the mods to neo4j.properties which you suggested and I'm 
>>>> afraid it still hangs in the same place.  (I'll look into the disk 
>>>> scheduler issue, but I can't make a global change like that on short 
>>>> notice).  The new jstack and messages.log are attached.
>>>>
>>>> On Tuesday, March 25, 2014 7:39:06 PM UTC-4, Michael Hunger wrote:
>>>>
>>>>> Joel,
>>>>>
>>>>> I looked at your logs. It seems there is a problem with the automatic 
>>>>> calculation for the MMIO for the neo4j store files.
>>>>>
>>>>> Could you uncomment the first lines in the conf/neo4j.properties 
>>>>> related to memory mapping?
>>>>> Just the default values should be good to get going
>>>>>
>>>>> Otherwise it is 14 bytes per node, 33 bytes per relationship and 38 
>>>>> bytes per 4 properties per node or rel.
>>>>>
>>>>> I currently tries to map several terabytes of memory :) which is 
>>>>> definitely not ok!
>>>>>
>>>>> 2014-03-25 21:18:34.849+0000 INFO  [o.n.k.i.n.s.StoreFactory]: 
>>>>> [data/graph.db/neostore.propertystore.db.strings] brickCount=0 
>>>>> brickSize=1516231424b mappedMem=1516231458816b (storeSize=128b)
>>>>>  2014-03-25 21:18:34.960+0000 INFO  [o.n.k.i.n.s.StoreFactory]: 
>>>>> [data/graph.db/neostore.propertystore.db.arrays] brickCount=0 
>>>>> brickSize=1718395776b mappedMem=1718395863040b (storeSize=128b)
>>>>> 2014-03-25 21:18:35.117+0000 INFO  [o.n.k.i.n.s.StoreFactory]: 
>>>>> [data/graph.db/neostore.propertystore.db] brickCount=0 
>>>>> brickSize=1783801801b mappedMem=1783801839616b (storeSize=41b)
>>>>> 2014-03-25 21:18:35.192+0000 INFO  [o.n.k.i.n.s.StoreFactory]: 
>>>>> [data/graph.db/neostore.relationshipstore.db] brickCount=0 
>>>>> brickSize=2147483646b mappedMem=2186031398912b (storeSize=33b)
>>>>> 2014-03-25 21:18:35.350+0000 INFO  [o.n.k.i.n.s.StoreFactory]: 
>>>>> [data/graph.db/neostore.nodestore.db.labels] brickCount=0 
>>>>> brickSize=0b mappedMem=0b (storeSize=68b)
>>>>> 2014-03-25 21:18:35.525+0000 INFO  [o.n.k.i.n.s.StoreFactory]: 
>>>>> [data/graph.db/neostore.nodestore.db] brickCount=0 
>>>>> brickSize=495500390b mappedMem=495500394496b (storeSize=14b)
>>>>>
>>>>> You should probably also switch your disk scheduler to deadline or 
>>>>> noop instead of the currently configured cfq
>>>>>
>>>>> Please ping me if that helped.
>>>>>
>>>>> Cheers
>>>>>
>>>>> Michael
>>>>>
>>>>> PS: Your machine is really impressive, I want to have one too :)
>>>>> 2014-03-25 21:02:11.800+0000 INFO  [o.n.k.i.DiagnosticsManager]: Total 
>>>>> Physical memory: 15.62 TB
>>>>> 2014-03-25 21:02:11.801+0000 INFO  [o.n.k.i.DiagnosticsManager]: Free 
>>>>> Physical memory: 12.71 TB
>>>>>
>>>>>
>>>>>
>>>>> On Tue, Mar 25, 2014 at 10:28 PM, Joel Welling <[email protected]>wrote:
>>>>>
>>>>>> Thank you very much for your extremely quick reply! The curl session 
>>>>>> with the X-Stream:true flag is below; as you can see it still hangs.  
>>>>>> The 
>>>>>> graph database is actually empty.  The actual response of the server to 
>>>>>> the 
>>>>>> curl message is at the end of the non-hung curl transcript above.
>>>>>>
>>>>>> The configuration for the server is exactly as in the community 
>>>>>> download, except for the following:
>>>>>> In neo4j.properties:
>>>>>>  org.neo4j.server.http.log.enabled=true
>>>>>>  org.neo4j.server.http.log.config=conf/neo4j-http-logging.xml
>>>>>>  org.neo4j.server.webserver.port=9494
>>>>>>  org.neo4j.server.webserver.https.port=9493
>>>>>>  org.neo4j.server.webserver.maxthreads=320
>>>>>> In 
>>>>>>  wrapper.java.additional=-XX:ParallelGCThreads=32
>>>>>>  wrapper.java.additional=-XX:ConcGCThreads=32
>>>>>>
>>>>>> I've attached the jstack thread dump and data/graph.db/messages.log 
>>>>>> files to this message.  The hung curl session looks like:
>>>>>>  > curl --trace-ascii - -X POST -H X-Stream:true -H "Content-Type: 
>>>>>> application/json" -d '{"query":"start a= node(*) return a"}' 
>>>>>> http://localhost:9494/db/data/cypher
>>>>>> == Info: About to connect() to localhost port 9494 (#0)
>>>>>> == Info:   Trying 127.0.0.1... == Info: connected
>>>>>> == Info: Connected to localhost (127.0.0.1) port 9494 (#0)
>>>>>> => Send header, 237 bytes (0xed)
>>>>>> 0000: POST /db/data/cypher HTTP/1.1
>>>>>> 001f: User-Agent: curl/7.19.0 (x86_64-suse-linux-gnu) libcurl/7.19.0 O
>>>>>> 005f: penSSL/0.9.8h zlib/1.2.7 libidn/1.10
>>>>>> 0085: Host: localhost:9494
>>>>>> 009b: Accept: */*
>>>>>> 00a8: X-Stream:true
>>>>>> 00b7: Content-Type: application/json
>>>>>> 00d7: Content-Length: 37
>>>>>> 00eb: 
>>>>>> => Send data, 37 bytes (0x25)
>>>>>> 0000: {"query":"start a= node(*) return a"}
>>>>>> ...and at this point it hangs...
>>>>>>
>>>>>>
>>>>>> On Tuesday, March 25, 2014 3:19:36 PM UTC-4, Michael Hunger wrote:
>>>>>>
>>>>>>> Joel,
>>>>>>>
>>>>>>> can you add the X-Stream:true header?
>>>>>>>
>>>>>>> How many nodes do you have in your graph? If you return them all it 
>>>>>>> is quite a amount of data that's returned. Without the streaming 
>>>>>>> header, 
>>>>>>> the Server builds up the response in memory and that most probably 
>>>>>>> causes 
>>>>>>> GC pauses or it just blows up with an OOM.
>>>>>>>
>>>>>>> What is your memory config for your Neo4j Server? Both in terms of 
>>>>>>> heap and mmio config?
>>>>>>>
>>>>>>> Any chance to share your data/graph.db/messages.log for some 
>>>>>>> diagnostics?
>>>>>>>
>>>>>>> A thread dump in the case when it hangs would be also super helpful, 
>>>>>>> either with jstack <pid> or kill -3 <pid>  (in the second case they'll 
>>>>>>> end 
>>>>>>> up in data/log/console.log)
>>>>>>>
>>>>>>> Thanks so much,
>>>>>>>
>>>>>>> Michael
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On Tue, Mar 25, 2014 at 8:04 PM, Joel Welling 
>>>>>>> <[email protected]>wrote:
>>>>>>>
>>>>>>>> Hi folks;
>>>>>>>>   I am running neo4j on an SGI UV machine.  It has a great many 
>>>>>>>> cores, but only a small subset (limited by the cpuset) are available 
>>>>>>>> to my 
>>>>>>>> neo4j server.  If I run neo4j community-2.0.1 with a configuration 
>>>>>>>> which is 
>>>>>>>> out-of-the-box except for setting -XX:ParallelGCThreads=32 and 
>>>>>>>> -XX:ConcGCThreads=32 in my neo4j-wrapper.conf, too many threads are 
>>>>>>>> allocated for the cores I actually have.  
>>>>>>>>   I can prevent this by setting server.webserver.maxthreads to some 
>>>>>>>> value, but the REST interface then hangs.  For example, here is a curl 
>>>>>>>> command which works if maxthreads is not set but hangs if it is set, 
>>>>>>>> even 
>>>>>>>> with a relatively large value like 320 threads:
>>>>>>>>
>>>>>>>>
>>>>>>>> > curl --trace-ascii - -X POST -H "Content-Type: application/json" 
>>>>>>>> -d '{"query":"start a= node(*) return a"}' 
>>>>>>>> http://localhost:9494/db/data/cypher
>>>>>>>> == Info: About to connect() to localhost port 9494 (#0)
>>>>>>>> == Info:   Trying 127.0.0.1... == Info: connected
>>>>>>>> == Info: Connected to localhost (127.0.0.1) port 9494 (#0)
>>>>>>>> => Send header, 213 bytes (0xd5)
>>>>>>>> 0000: POST /db/data/cypher HTTP/1.1
>>>>>>>> 001f: User-Agent: curl/7.21.3 (x86_64-unknown-linux-gnu) 
>>>>>>>> libcurl/7.21.
>>>>>>>> 005f: 3 OpenSSL/0.9.8h zlib/1.2.7
>>>>>>>> 007c: Host: localhost:9494
>>>>>>>> 0092: Accept: */*
>>>>>>>> 009f: Content-Type: application/json
>>>>>>>> 00bf: Content-Length: 37
>>>>>>>> 00d3: 
>>>>>>>> => Send data, 37 bytes (0x25)
>>>>>>>> 0000: {"query":"start a= node(*) return a"}   
>>>>>>>> <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<< HANGS AT THIS POINT 
>>>>>>>> <= Recv header, 17 bytes (0x11)
>>>>>>>> 0000: HTTP/1.1 200 OK
>>>>>>>> <= Recv header, 47 bytes (0x2f)
>>>>>>>> 0000: Content-Type: application/json; charset=UTF-8
>>>>>>>> <= Recv header, 32 bytes (0x20)
>>>>>>>> 0000: Access-Control-Allow-Origin: *
>>>>>>>> <= Recv header, 20 bytes (0x14)
>>>>>>>> 0000: Content-Length: 41
>>>>>>>> <= Recv header, 32 bytes (0x20)
>>>>>>>> 0000: Server: Jetty(9.0.5.v20130815)
>>>>>>>> <= Recv header, 2 bytes (0x2)
>>>>>>>> 0000: 
>>>>>>>> <= Recv data, 41 bytes (0x29)
>>>>>>>> 0000: {.  "columns" : [ "a" ],.  "data" : [ ].}
>>>>>>>> {
>>>>>>>>   "columns" : [ "a" ],
>>>>>>>>   "data" : [ ]
>>>>>>>> }== Info: Connection #0 to host localhost left intact
>>>>>>>> == Info: Closing connection #0
>>>>>>>>
>>>>>>>> If I were on a 32-core machine rather than a 2000-core machine, 
>>>>>>>> maxthreads=320 would be the default.  Thus I'm guessing that something 
>>>>>>>> is 
>>>>>>>> competing for threads within that 320-thread pool, or else the server 
>>>>>>>> is 
>>>>>>>> internally calculating a ratio of threads-per-core and that ratio is 
>>>>>>>> yielding zero on my machine. Is there any way to work around this?
>>>>>>>>
>>>>>>>> Thanks,
>>>>>>>> -Joel Welling
>>>>>>>>
>>>>>>>>  -- 
>>>>>>>> You received this message because you are subscribed to the Google 
>>>>>>>> Groups "Neo4j" group.
>>>>>>>> To unsubscribe from this group and stop receiving emails from it, 
>>>>>>>> send an email to [email protected].
>>>>>>>>
>>>>>>>> For more options, visit https://groups.google.com/d/optout.
>>>>>>>>
>>>>>>>
>>>>>>>  -- 
>>>>>> You received this message because you are subscribed to the Google 
>>>>>> Groups "Neo4j" group.
>>>>>> To unsubscribe from this group and stop receiving emails from it, 
>>>>>> send an email to [email protected].
>>>>>> For more options, visit https://groups.google.com/d/optout.
>>>>>>
>>>>>
>>>>>  -- 
>>>> You received this message because you are subscribed to the Google 
>>>> Groups "Neo4j" group.
>>>> To unsubscribe from this group and stop receiving emails from it, send 
>>>> an email to [email protected].
>>>> For more options, visit https://groups.google.com/d/optout.
>>>>
>>>
>>>

-- 
You received this message because you are subscribed to the Google Groups 
"Neo4j" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
For more options, visit https://groups.google.com/d/optout.

Reply via email to