I have to amend my previous statement a bit- on the smaller front end machine (SGI UV, 24GB, 16 cores*2 for hyperthreading) the server runs fine at 320 threads but ***hangs at 32***. I've attached jstack dumps for the server on the big machine at 32 threads (jstack_out3) and in the hung state on the small machine at 32 threads (jstack_small_32).
On Thursday, March 27, 2014 10:39:43 AM UTC-4, Joel Welling wrote: > > Hi folks- > First, thanks for your attention! > Yes, the thread dumps were taken in the hung condition. There was > definitely a request pending, at least as far as the 'curl' command was > concerned. > I've tried with 32 threads and seen the same result, but I'll repeat the > experiment and provide the thread dump. I'll note that there is a 'front > end' SGI UV with fewer cores which can see the same file system, and neo4j > runs just fine on that machine from the same config files for which it > hangs on the big machine. > I wouldn't say we are optimizing for any use case at the moment, just > trying to get things to run. But our use case would be low concurrency, > large data volume. The goal is to provide graph analytics support for NSF > researchers with large graphical datasets, so the datasets would be large > but would be accessed by only one group. > And I'll see about setting up a temporary account to let you folks try > stuff. It will have to be under a person's name, and we'll have to have a > physical mail address and email address for that person. If possible, > could you pick someone and email that info to me at welling at psc.edu ? > > Thanks very much for your attention! > -Joel > > On Wednesday, March 26, 2014 8:26:54 PM UTC-4, Jacob Hansson wrote: >> >> Joel: Thanks for the detailed info on this, super helpful, and I'm sorry >> you're running into issues on this system (which, holy crap, that is a huge >> machine). >> >> Looking through the stack traces, the entire system seems idle waiting >> for requests from the network. Is this thread dump taken while the system >> is hung? If yes, try lowering the number of threads even more, to see if it >> has something to do with jetty choking on the network selection somehow >> (and I mean low as in give it 32 threads, to clearly rule out choking in >> the network layer). If this thread dump is from when the the system is >> idle, could you send a thread dump from when it is hung after issuing a >> request like you described? >> >> Thanks a lot! >> Jake >> >> >> On Wed, Mar 26, 2014 at 10:57 PM, Joel Welling <[email protected]>wrote: >> >>> Hi Michael- >>> >>> >>> > PS: Your machine is really impressive, I want to have one too :) >>> > 2014-03-25 21:02:11.800+0000 INFO [o.n.k.i.DiagnosticsManager]: >>> Total Physical memory: 15.62 TB >>> > 2014-03-25 21:02:11.801+0000 INFO [o.n.k.i.DiagnosticsManager]: Free >>> Physical memory: 12.71 TB >>> >>> Thank you! It's this machine: >>> http://www.psc.edu/index.php/computing-resources/blacklight . For >>> quite a while it was the world's largest shared-memory environment. If you >>> want to try some timings, it could probably be arranged. >>> >>> Anyway, I made the mods to neo4j.properties which you suggested and I'm >>> afraid it still hangs in the same place. (I'll look into the disk >>> scheduler issue, but I can't make a global change like that on short >>> notice). The new jstack and messages.log are attached. >>> >>> On Tuesday, March 25, 2014 7:39:06 PM UTC-4, Michael Hunger wrote: >>> >>>> Joel, >>>> >>>> I looked at your logs. It seems there is a problem with the automatic >>>> calculation for the MMIO for the neo4j store files. >>>> >>>> Could you uncomment the first lines in the conf/neo4j.properties >>>> related to memory mapping? >>>> Just the default values should be good to get going >>>> >>>> Otherwise it is 14 bytes per node, 33 bytes per relationship and 38 >>>> bytes per 4 properties per node or rel. >>>> >>>> I currently tries to map several terabytes of memory :) which is >>>> definitely not ok! >>>> >>>> 2014-03-25 21:18:34.849+0000 INFO [o.n.k.i.n.s.StoreFactory]: >>>> [data/graph.db/neostore.propertystore.db.strings] brickCount=0 >>>> brickSize=1516231424b mappedMem=1516231458816b (storeSize=128b) >>>> 2014-03-25 21:18:34.960+0000 INFO [o.n.k.i.n.s.StoreFactory]: >>>> [data/graph.db/neostore.propertystore.db.arrays] brickCount=0 >>>> brickSize=1718395776b mappedMem=1718395863040b (storeSize=128b) >>>> 2014-03-25 21:18:35.117+0000 INFO [o.n.k.i.n.s.StoreFactory]: >>>> [data/graph.db/neostore.propertystore.db] brickCount=0 >>>> brickSize=1783801801b mappedMem=1783801839616b (storeSize=41b) >>>> 2014-03-25 21:18:35.192+0000 INFO [o.n.k.i.n.s.StoreFactory]: >>>> [data/graph.db/neostore.relationshipstore.db] brickCount=0 >>>> brickSize=2147483646b mappedMem=2186031398912b (storeSize=33b) >>>> 2014-03-25 21:18:35.350+0000 INFO [o.n.k.i.n.s.StoreFactory]: >>>> [data/graph.db/neostore.nodestore.db.labels] brickCount=0 brickSize=0b >>>> mappedMem=0b (storeSize=68b) >>>> 2014-03-25 21:18:35.525+0000 INFO [o.n.k.i.n.s.StoreFactory]: >>>> [data/graph.db/neostore.nodestore.db] brickCount=0 >>>> brickSize=495500390b mappedMem=495500394496b (storeSize=14b) >>>> >>>> You should probably also switch your disk scheduler to deadline or noop >>>> instead of the currently configured cfq >>>> >>>> Please ping me if that helped. >>>> >>>> Cheers >>>> >>>> Michael >>>> >>>> PS: Your machine is really impressive, I want to have one too :) >>>> 2014-03-25 21:02:11.800+0000 INFO [o.n.k.i.DiagnosticsManager]: Total >>>> Physical memory: 15.62 TB >>>> 2014-03-25 21:02:11.801+0000 INFO [o.n.k.i.DiagnosticsManager]: Free >>>> Physical memory: 12.71 TB >>>> >>>> >>>> >>>> On Tue, Mar 25, 2014 at 10:28 PM, Joel Welling <[email protected]>wrote: >>>> >>>>> Thank you very much for your extremely quick reply! The curl session >>>>> with the X-Stream:true flag is below; as you can see it still hangs. The >>>>> graph database is actually empty. The actual response of the server to >>>>> the >>>>> curl message is at the end of the non-hung curl transcript above. >>>>> >>>>> The configuration for the server is exactly as in the community >>>>> download, except for the following: >>>>> In neo4j.properties: >>>>> org.neo4j.server.http.log.enabled=true >>>>> org.neo4j.server.http.log.config=conf/neo4j-http-logging.xml >>>>> org.neo4j.server.webserver.port=9494 >>>>> org.neo4j.server.webserver.https.port=9493 >>>>> org.neo4j.server.webserver.maxthreads=320 >>>>> In >>>>> wrapper.java.additional=-XX:ParallelGCThreads=32 >>>>> wrapper.java.additional=-XX:ConcGCThreads=32 >>>>> >>>>> I've attached the jstack thread dump and data/graph.db/messages.log >>>>> files to this message. The hung curl session looks like: >>>>> > curl --trace-ascii - -X POST -H X-Stream:true -H "Content-Type: >>>>> application/json" -d '{"query":"start a= node(*) return a"}' >>>>> http://localhost:9494/db/data/cypher >>>>> == Info: About to connect() to localhost port 9494 (#0) >>>>> == Info: Trying 127.0.0.1... == Info: connected >>>>> == Info: Connected to localhost (127.0.0.1) port 9494 (#0) >>>>> => Send header, 237 bytes (0xed) >>>>> 0000: POST /db/data/cypher HTTP/1.1 >>>>> 001f: User-Agent: curl/7.19.0 (x86_64-suse-linux-gnu) libcurl/7.19.0 O >>>>> 005f: penSSL/0.9.8h zlib/1.2.7 libidn/1.10 >>>>> 0085: Host: localhost:9494 >>>>> 009b: Accept: */* >>>>> 00a8: X-Stream:true >>>>> 00b7: Content-Type: application/json >>>>> 00d7: Content-Length: 37 >>>>> 00eb: >>>>> => Send data, 37 bytes (0x25) >>>>> 0000: {"query":"start a= node(*) return a"} >>>>> ...and at this point it hangs... >>>>> >>>>> >>>>> On Tuesday, March 25, 2014 3:19:36 PM UTC-4, Michael Hunger wrote: >>>>> >>>>>> Joel, >>>>>> >>>>>> can you add the X-Stream:true header? >>>>>> >>>>>> How many nodes do you have in your graph? If you return them all it >>>>>> is quite a amount of data that's returned. Without the streaming header, >>>>>> the Server builds up the response in memory and that most probably >>>>>> causes >>>>>> GC pauses or it just blows up with an OOM. >>>>>> >>>>>> What is your memory config for your Neo4j Server? Both in terms of >>>>>> heap and mmio config? >>>>>> >>>>>> Any chance to share your data/graph.db/messages.log for some >>>>>> diagnostics? >>>>>> >>>>>> A thread dump in the case when it hangs would be also super helpful, >>>>>> either with jstack <pid> or kill -3 <pid> (in the second case they'll >>>>>> end >>>>>> up in data/log/console.log) >>>>>> >>>>>> Thanks so much, >>>>>> >>>>>> Michael >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> On Tue, Mar 25, 2014 at 8:04 PM, Joel Welling <[email protected]>wrote: >>>>>> >>>>>>> Hi folks; >>>>>>> I am running neo4j on an SGI UV machine. It has a great many >>>>>>> cores, but only a small subset (limited by the cpuset) are available to >>>>>>> my >>>>>>> neo4j server. If I run neo4j community-2.0.1 with a configuration >>>>>>> which is >>>>>>> out-of-the-box except for setting -XX:ParallelGCThreads=32 and >>>>>>> -XX:ConcGCThreads=32 in my neo4j-wrapper.conf, too many threads are >>>>>>> allocated for the cores I actually have. >>>>>>> I can prevent this by setting server.webserver.maxthreads to some >>>>>>> value, but the REST interface then hangs. For example, here is a curl >>>>>>> command which works if maxthreads is not set but hangs if it is set, >>>>>>> even >>>>>>> with a relatively large value like 320 threads: >>>>>>> >>>>>>> >>>>>>> > curl --trace-ascii - -X POST -H "Content-Type: application/json" >>>>>>> -d '{"query":"start a= node(*) return a"}' >>>>>>> http://localhost:9494/db/data/cypher >>>>>>> == Info: About to connect() to localhost port 9494 (#0) >>>>>>> == Info: Trying 127.0.0.1... == Info: connected >>>>>>> == Info: Connected to localhost (127.0.0.1) port 9494 (#0) >>>>>>> => Send header, 213 bytes (0xd5) >>>>>>> 0000: POST /db/data/cypher HTTP/1.1 >>>>>>> 001f: User-Agent: curl/7.21.3 (x86_64-unknown-linux-gnu) >>>>>>> libcurl/7.21. >>>>>>> 005f: 3 OpenSSL/0.9.8h zlib/1.2.7 >>>>>>> 007c: Host: localhost:9494 >>>>>>> 0092: Accept: */* >>>>>>> 009f: Content-Type: application/json >>>>>>> 00bf: Content-Length: 37 >>>>>>> 00d3: >>>>>>> => Send data, 37 bytes (0x25) >>>>>>> 0000: {"query":"start a= node(*) return a"} >>>>>>> <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<< HANGS AT THIS POINT >>>>>>> <= Recv header, 17 bytes (0x11) >>>>>>> 0000: HTTP/1.1 200 OK >>>>>>> <= Recv header, 47 bytes (0x2f) >>>>>>> 0000: Content-Type: application/json; charset=UTF-8 >>>>>>> <= Recv header, 32 bytes (0x20) >>>>>>> 0000: Access-Control-Allow-Origin: * >>>>>>> <= Recv header, 20 bytes (0x14) >>>>>>> 0000: Content-Length: 41 >>>>>>> <= Recv header, 32 bytes (0x20) >>>>>>> 0000: Server: Jetty(9.0.5.v20130815) >>>>>>> <= Recv header, 2 bytes (0x2) >>>>>>> 0000: >>>>>>> <= Recv data, 41 bytes (0x29) >>>>>>> 0000: {. "columns" : [ "a" ],. "data" : [ ].} >>>>>>> { >>>>>>> "columns" : [ "a" ], >>>>>>> "data" : [ ] >>>>>>> }== Info: Connection #0 to host localhost left intact >>>>>>> == Info: Closing connection #0 >>>>>>> >>>>>>> If I were on a 32-core machine rather than a 2000-core machine, >>>>>>> maxthreads=320 would be the default. Thus I'm guessing that something >>>>>>> is >>>>>>> competing for threads within that 320-thread pool, or else the server >>>>>>> is >>>>>>> internally calculating a ratio of threads-per-core and that ratio is >>>>>>> yielding zero on my machine. Is there any way to work around this? >>>>>>> >>>>>>> Thanks, >>>>>>> -Joel Welling >>>>>>> >>>>>>> -- >>>>>>> You received this message because you are subscribed to the Google >>>>>>> Groups "Neo4j" group. >>>>>>> To unsubscribe from this group and stop receiving emails from it, >>>>>>> send an email to [email protected]. >>>>>>> >>>>>>> For more options, visit https://groups.google.com/d/optout. >>>>>>> >>>>>> >>>>>> -- >>>>> You received this message because you are subscribed to the Google >>>>> Groups "Neo4j" group. >>>>> To unsubscribe from this group and stop receiving emails from it, send >>>>> an email to [email protected]. >>>>> For more options, visit https://groups.google.com/d/optout. >>>>> >>>> >>>> -- >>> You received this message because you are subscribed to the Google >>> Groups "Neo4j" group. >>> To unsubscribe from this group and stop receiving emails from it, send >>> an email to [email protected]. >>> For more options, visit https://groups.google.com/d/optout. >>> >> >> -- You received this message because you are subscribed to the Google Groups "Neo4j" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. For more options, visit https://groups.google.com/d/optout.
jstack_out3.gz
Description: Binary data
jstack_small_32.gz
Description: Binary data
