Here's an interesting tidbit: on the small (16*2 core) machine, the server hangs at 32 or 33 threads but does not hang at 64.
On Thursday, March 27, 2014 11:14:49 AM UTC-4, Joel Welling wrote: > > I have to amend my previous statement a bit- on the smaller front end > machine (SGI UV, 24GB, 16 cores*2 for hyperthreading) the server runs fine > at 320 threads but ***hangs at 32***. I've attached jstack dumps for the > server on the big machine at 32 threads (jstack_out3) and in the hung state > on the small machine at 32 threads (jstack_small_32). > > On Thursday, March 27, 2014 10:39:43 AM UTC-4, Joel Welling wrote: >> >> Hi folks- >> First, thanks for your attention! >> Yes, the thread dumps were taken in the hung condition. There was >> definitely a request pending, at least as far as the 'curl' command was >> concerned. >> I've tried with 32 threads and seen the same result, but I'll repeat >> the experiment and provide the thread dump. I'll note that there is a >> 'front end' SGI UV with fewer cores which can see the same file system, and >> neo4j runs just fine on that machine from the same config files for which >> it hangs on the big machine. >> I wouldn't say we are optimizing for any use case at the moment, just >> trying to get things to run. But our use case would be low concurrency, >> large data volume. The goal is to provide graph analytics support for NSF >> researchers with large graphical datasets, so the datasets would be large >> but would be accessed by only one group. >> And I'll see about setting up a temporary account to let you folks try >> stuff. It will have to be under a person's name, and we'll have to have a >> physical mail address and email address for that person. If possible, >> could you pick someone and email that info to me at welling at psc.edu ? >> >> Thanks very much for your attention! >> -Joel >> >> On Wednesday, March 26, 2014 8:26:54 PM UTC-4, Jacob Hansson wrote: >>> >>> Joel: Thanks for the detailed info on this, super helpful, and I'm sorry >>> you're running into issues on this system (which, holy crap, that is a huge >>> machine). >>> >>> Looking through the stack traces, the entire system seems idle waiting >>> for requests from the network. Is this thread dump taken while the system >>> is hung? If yes, try lowering the number of threads even more, to see if it >>> has something to do with jetty choking on the network selection somehow >>> (and I mean low as in give it 32 threads, to clearly rule out choking in >>> the network layer). If this thread dump is from when the the system is >>> idle, could you send a thread dump from when it is hung after issuing a >>> request like you described? >>> >>> Thanks a lot! >>> Jake >>> >>> >>> On Wed, Mar 26, 2014 at 10:57 PM, Joel Welling <[email protected]>wrote: >>> >>>> Hi Michael- >>>> >>>> >>>> > PS: Your machine is really impressive, I want to have one too :) >>>> > 2014-03-25 21:02:11.800+0000 INFO [o.n.k.i.DiagnosticsManager]: >>>> Total Physical memory: 15.62 TB >>>> > 2014-03-25 21:02:11.801+0000 INFO [o.n.k.i.DiagnosticsManager]: >>>> Free Physical memory: 12.71 TB >>>> >>>> Thank you! It's this machine: >>>> http://www.psc.edu/index.php/computing-resources/blacklight . For >>>> quite a while it was the world's largest shared-memory environment. If >>>> you >>>> want to try some timings, it could probably be arranged. >>>> >>>> Anyway, I made the mods to neo4j.properties which you suggested and I'm >>>> afraid it still hangs in the same place. (I'll look into the disk >>>> scheduler issue, but I can't make a global change like that on short >>>> notice). The new jstack and messages.log are attached. >>>> >>>> On Tuesday, March 25, 2014 7:39:06 PM UTC-4, Michael Hunger wrote: >>>> >>>>> Joel, >>>>> >>>>> I looked at your logs. It seems there is a problem with the automatic >>>>> calculation for the MMIO for the neo4j store files. >>>>> >>>>> Could you uncomment the first lines in the conf/neo4j.properties >>>>> related to memory mapping? >>>>> Just the default values should be good to get going >>>>> >>>>> Otherwise it is 14 bytes per node, 33 bytes per relationship and 38 >>>>> bytes per 4 properties per node or rel. >>>>> >>>>> I currently tries to map several terabytes of memory :) which is >>>>> definitely not ok! >>>>> >>>>> 2014-03-25 21:18:34.849+0000 INFO [o.n.k.i.n.s.StoreFactory]: >>>>> [data/graph.db/neostore.propertystore.db.strings] brickCount=0 >>>>> brickSize=1516231424b mappedMem=1516231458816b (storeSize=128b) >>>>> 2014-03-25 21:18:34.960+0000 INFO [o.n.k.i.n.s.StoreFactory]: >>>>> [data/graph.db/neostore.propertystore.db.arrays] brickCount=0 >>>>> brickSize=1718395776b mappedMem=1718395863040b (storeSize=128b) >>>>> 2014-03-25 21:18:35.117+0000 INFO [o.n.k.i.n.s.StoreFactory]: >>>>> [data/graph.db/neostore.propertystore.db] brickCount=0 >>>>> brickSize=1783801801b mappedMem=1783801839616b (storeSize=41b) >>>>> 2014-03-25 21:18:35.192+0000 INFO [o.n.k.i.n.s.StoreFactory]: >>>>> [data/graph.db/neostore.relationshipstore.db] brickCount=0 >>>>> brickSize=2147483646b mappedMem=2186031398912b (storeSize=33b) >>>>> 2014-03-25 21:18:35.350+0000 INFO [o.n.k.i.n.s.StoreFactory]: >>>>> [data/graph.db/neostore.nodestore.db.labels] brickCount=0 >>>>> brickSize=0b mappedMem=0b (storeSize=68b) >>>>> 2014-03-25 21:18:35.525+0000 INFO [o.n.k.i.n.s.StoreFactory]: >>>>> [data/graph.db/neostore.nodestore.db] brickCount=0 >>>>> brickSize=495500390b mappedMem=495500394496b (storeSize=14b) >>>>> >>>>> You should probably also switch your disk scheduler to deadline or >>>>> noop instead of the currently configured cfq >>>>> >>>>> Please ping me if that helped. >>>>> >>>>> Cheers >>>>> >>>>> Michael >>>>> >>>>> PS: Your machine is really impressive, I want to have one too :) >>>>> 2014-03-25 21:02:11.800+0000 INFO [o.n.k.i.DiagnosticsManager]: Total >>>>> Physical memory: 15.62 TB >>>>> 2014-03-25 21:02:11.801+0000 INFO [o.n.k.i.DiagnosticsManager]: Free >>>>> Physical memory: 12.71 TB >>>>> >>>>> >>>>> >>>>> On Tue, Mar 25, 2014 at 10:28 PM, Joel Welling <[email protected]>wrote: >>>>> >>>>>> Thank you very much for your extremely quick reply! The curl session >>>>>> with the X-Stream:true flag is below; as you can see it still hangs. >>>>>> The >>>>>> graph database is actually empty. The actual response of the server to >>>>>> the >>>>>> curl message is at the end of the non-hung curl transcript above. >>>>>> >>>>>> The configuration for the server is exactly as in the community >>>>>> download, except for the following: >>>>>> In neo4j.properties: >>>>>> org.neo4j.server.http.log.enabled=true >>>>>> org.neo4j.server.http.log.config=conf/neo4j-http-logging.xml >>>>>> org.neo4j.server.webserver.port=9494 >>>>>> org.neo4j.server.webserver.https.port=9493 >>>>>> org.neo4j.server.webserver.maxthreads=320 >>>>>> In >>>>>> wrapper.java.additional=-XX:ParallelGCThreads=32 >>>>>> wrapper.java.additional=-XX:ConcGCThreads=32 >>>>>> >>>>>> I've attached the jstack thread dump and data/graph.db/messages.log >>>>>> files to this message. The hung curl session looks like: >>>>>> > curl --trace-ascii - -X POST -H X-Stream:true -H "Content-Type: >>>>>> application/json" -d '{"query":"start a= node(*) return a"}' >>>>>> http://localhost:9494/db/data/cypher >>>>>> == Info: About to connect() to localhost port 9494 (#0) >>>>>> == Info: Trying 127.0.0.1... == Info: connected >>>>>> == Info: Connected to localhost (127.0.0.1) port 9494 (#0) >>>>>> => Send header, 237 bytes (0xed) >>>>>> 0000: POST /db/data/cypher HTTP/1.1 >>>>>> 001f: User-Agent: curl/7.19.0 (x86_64-suse-linux-gnu) libcurl/7.19.0 O >>>>>> 005f: penSSL/0.9.8h zlib/1.2.7 libidn/1.10 >>>>>> 0085: Host: localhost:9494 >>>>>> 009b: Accept: */* >>>>>> 00a8: X-Stream:true >>>>>> 00b7: Content-Type: application/json >>>>>> 00d7: Content-Length: 37 >>>>>> 00eb: >>>>>> => Send data, 37 bytes (0x25) >>>>>> 0000: {"query":"start a= node(*) return a"} >>>>>> ...and at this point it hangs... >>>>>> >>>>>> >>>>>> On Tuesday, March 25, 2014 3:19:36 PM UTC-4, Michael Hunger wrote: >>>>>> >>>>>>> Joel, >>>>>>> >>>>>>> can you add the X-Stream:true header? >>>>>>> >>>>>>> How many nodes do you have in your graph? If you return them all it >>>>>>> is quite a amount of data that's returned. Without the streaming >>>>>>> header, >>>>>>> the Server builds up the response in memory and that most probably >>>>>>> causes >>>>>>> GC pauses or it just blows up with an OOM. >>>>>>> >>>>>>> What is your memory config for your Neo4j Server? Both in terms of >>>>>>> heap and mmio config? >>>>>>> >>>>>>> Any chance to share your data/graph.db/messages.log for some >>>>>>> diagnostics? >>>>>>> >>>>>>> A thread dump in the case when it hangs would be also super helpful, >>>>>>> either with jstack <pid> or kill -3 <pid> (in the second case they'll >>>>>>> end >>>>>>> up in data/log/console.log) >>>>>>> >>>>>>> Thanks so much, >>>>>>> >>>>>>> Michael >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> On Tue, Mar 25, 2014 at 8:04 PM, Joel Welling >>>>>>> <[email protected]>wrote: >>>>>>> >>>>>>>> Hi folks; >>>>>>>> I am running neo4j on an SGI UV machine. It has a great many >>>>>>>> cores, but only a small subset (limited by the cpuset) are available >>>>>>>> to my >>>>>>>> neo4j server. If I run neo4j community-2.0.1 with a configuration >>>>>>>> which is >>>>>>>> out-of-the-box except for setting -XX:ParallelGCThreads=32 and >>>>>>>> -XX:ConcGCThreads=32 in my neo4j-wrapper.conf, too many threads are >>>>>>>> allocated for the cores I actually have. >>>>>>>> I can prevent this by setting server.webserver.maxthreads to some >>>>>>>> value, but the REST interface then hangs. For example, here is a curl >>>>>>>> command which works if maxthreads is not set but hangs if it is set, >>>>>>>> even >>>>>>>> with a relatively large value like 320 threads: >>>>>>>> >>>>>>>> >>>>>>>> > curl --trace-ascii - -X POST -H "Content-Type: application/json" >>>>>>>> -d '{"query":"start a= node(*) return a"}' >>>>>>>> http://localhost:9494/db/data/cypher >>>>>>>> == Info: About to connect() to localhost port 9494 (#0) >>>>>>>> == Info: Trying 127.0.0.1... == Info: connected >>>>>>>> == Info: Connected to localhost (127.0.0.1) port 9494 (#0) >>>>>>>> => Send header, 213 bytes (0xd5) >>>>>>>> 0000: POST /db/data/cypher HTTP/1.1 >>>>>>>> 001f: User-Agent: curl/7.21.3 (x86_64-unknown-linux-gnu) >>>>>>>> libcurl/7.21. >>>>>>>> 005f: 3 OpenSSL/0.9.8h zlib/1.2.7 >>>>>>>> 007c: Host: localhost:9494 >>>>>>>> 0092: Accept: */* >>>>>>>> 009f: Content-Type: application/json >>>>>>>> 00bf: Content-Length: 37 >>>>>>>> 00d3: >>>>>>>> => Send data, 37 bytes (0x25) >>>>>>>> 0000: {"query":"start a= node(*) return a"} >>>>>>>> <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<< HANGS AT THIS POINT >>>>>>>> <= Recv header, 17 bytes (0x11) >>>>>>>> 0000: HTTP/1.1 200 OK >>>>>>>> <= Recv header, 47 bytes (0x2f) >>>>>>>> 0000: Content-Type: application/json; charset=UTF-8 >>>>>>>> <= Recv header, 32 bytes (0x20) >>>>>>>> 0000: Access-Control-Allow-Origin: * >>>>>>>> <= Recv header, 20 bytes (0x14) >>>>>>>> 0000: Content-Length: 41 >>>>>>>> <= Recv header, 32 bytes (0x20) >>>>>>>> 0000: Server: Jetty(9.0.5.v20130815) >>>>>>>> <= Recv header, 2 bytes (0x2) >>>>>>>> 0000: >>>>>>>> <= Recv data, 41 bytes (0x29) >>>>>>>> 0000: {. "columns" : [ "a" ],. "data" : [ ].} >>>>>>>> { >>>>>>>> "columns" : [ "a" ], >>>>>>>> "data" : [ ] >>>>>>>> }== Info: Connection #0 to host localhost left intact >>>>>>>> == Info: Closing connection #0 >>>>>>>> >>>>>>>> If I were on a 32-core machine rather than a 2000-core machine, >>>>>>>> maxthreads=320 would be the default. Thus I'm guessing that something >>>>>>>> is >>>>>>>> competing for threads within that 320-thread pool, or else the server >>>>>>>> is >>>>>>>> internally calculating a ratio of threads-per-core and that ratio is >>>>>>>> yielding zero on my machine. Is there any way to work around this? >>>>>>>> >>>>>>>> Thanks, >>>>>>>> -Joel Welling >>>>>>>> >>>>>>>> -- >>>>>>>> You received this message because you are subscribed to the Google >>>>>>>> Groups "Neo4j" group. >>>>>>>> To unsubscribe from this group and stop receiving emails from it, >>>>>>>> send an email to [email protected]. >>>>>>>> >>>>>>>> For more options, visit https://groups.google.com/d/optout. >>>>>>>> >>>>>>> >>>>>>> -- >>>>>> You received this message because you are subscribed to the Google >>>>>> Groups "Neo4j" group. >>>>>> To unsubscribe from this group and stop receiving emails from it, >>>>>> send an email to [email protected]. >>>>>> For more options, visit https://groups.google.com/d/optout. >>>>>> >>>>> >>>>> -- >>>> You received this message because you are subscribed to the Google >>>> Groups "Neo4j" group. >>>> To unsubscribe from this group and stop receiving emails from it, send >>>> an email to [email protected]. >>>> For more options, visit https://groups.google.com/d/optout. >>>> >>> >>> -- You received this message because you are subscribed to the Google Groups "Neo4j" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. For more options, visit https://groups.google.com/d/optout.
