One more thing, could you describe briefly what use case you are optimizing for (eg. high concurrency, large dataset or some such)?
Oh, and yes, if there is a possibility to get some brief time running a few benchmarks, I sure would be happy to try a few things out on a system of this scale. Jake On Thu, Mar 27, 2014 at 1:26 AM, Jacob Hansson <[email protected]>wrote: > Joel: Thanks for the detailed info on this, super helpful, and I'm sorry > you're running into issues on this system (which, holy crap, that is a huge > machine). > > Looking through the stack traces, the entire system seems idle waiting for > requests from the network. Is this thread dump taken while the system is > hung? If yes, try lowering the number of threads even more, to see if it > has something to do with jetty choking on the network selection somehow > (and I mean low as in give it 32 threads, to clearly rule out choking in > the network layer). If this thread dump is from when the the system is > idle, could you send a thread dump from when it is hung after issuing a > request like you described? > > Thanks a lot! > Jake > > > On Wed, Mar 26, 2014 at 10:57 PM, Joel Welling <[email protected]>wrote: > >> Hi Michael- >> >> >> > PS: Your machine is really impressive, I want to have one too :) >> > 2014-03-25 21:02:11.800+0000 INFO [o.n.k.i.DiagnosticsManager]: Total >> Physical memory: 15.62 TB >> > 2014-03-25 21:02:11.801+0000 INFO [o.n.k.i.DiagnosticsManager]: Free >> Physical memory: 12.71 TB >> >> Thank you! It's this machine: >> http://www.psc.edu/index.php/computing-resources/blacklight . For quite >> a while it was the world's largest shared-memory environment. If you want >> to try some timings, it could probably be arranged. >> >> Anyway, I made the mods to neo4j.properties which you suggested and I'm >> afraid it still hangs in the same place. (I'll look into the disk >> scheduler issue, but I can't make a global change like that on short >> notice). The new jstack and messages.log are attached. >> >> On Tuesday, March 25, 2014 7:39:06 PM UTC-4, Michael Hunger wrote: >> >>> Joel, >>> >>> I looked at your logs. It seems there is a problem with the automatic >>> calculation for the MMIO for the neo4j store files. >>> >>> Could you uncomment the first lines in the conf/neo4j.properties related >>> to memory mapping? >>> Just the default values should be good to get going >>> >>> Otherwise it is 14 bytes per node, 33 bytes per relationship and 38 >>> bytes per 4 properties per node or rel. >>> >>> I currently tries to map several terabytes of memory :) which is >>> definitely not ok! >>> >>> 2014-03-25 21:18:34.849+0000 INFO [o.n.k.i.n.s.StoreFactory]: >>> [data/graph.db/neostore.propertystore.db.strings] brickCount=0 >>> brickSize=1516231424b mappedMem=1516231458816b (storeSize=128b) >>> 2014-03-25 21:18:34.960+0000 INFO [o.n.k.i.n.s.StoreFactory]: >>> [data/graph.db/neostore.propertystore.db.arrays] brickCount=0 >>> brickSize=1718395776b mappedMem=1718395863040b (storeSize=128b) >>> 2014-03-25 21:18:35.117+0000 INFO [o.n.k.i.n.s.StoreFactory]: >>> [data/graph.db/neostore.propertystore.db] brickCount=0 >>> brickSize=1783801801b mappedMem=1783801839616b (storeSize=41b) >>> 2014-03-25 21:18:35.192+0000 INFO [o.n.k.i.n.s.StoreFactory]: >>> [data/graph.db/neostore.relationshipstore.db] brickCount=0 >>> brickSize=2147483646b mappedMem=2186031398912b (storeSize=33b) >>> 2014-03-25 21:18:35.350+0000 INFO [o.n.k.i.n.s.StoreFactory]: >>> [data/graph.db/neostore.nodestore.db.labels] brickCount=0 brickSize=0b >>> mappedMem=0b (storeSize=68b) >>> 2014-03-25 21:18:35.525+0000 INFO [o.n.k.i.n.s.StoreFactory]: >>> [data/graph.db/neostore.nodestore.db] brickCount=0 brickSize=495500390b >>> mappedMem=495500394496b (storeSize=14b) >>> >>> You should probably also switch your disk scheduler to deadline or noop >>> instead of the currently configured cfq >>> >>> Please ping me if that helped. >>> >>> Cheers >>> >>> Michael >>> >>> PS: Your machine is really impressive, I want to have one too :) >>> 2014-03-25 21:02:11.800+0000 INFO [o.n.k.i.DiagnosticsManager]: Total >>> Physical memory: 15.62 TB >>> 2014-03-25 21:02:11.801+0000 INFO [o.n.k.i.DiagnosticsManager]: Free >>> Physical memory: 12.71 TB >>> >>> >>> >>> On Tue, Mar 25, 2014 at 10:28 PM, Joel Welling <[email protected]>wrote: >>> >>>> Thank you very much for your extremely quick reply! The curl session >>>> with the X-Stream:true flag is below; as you can see it still hangs. The >>>> graph database is actually empty. The actual response of the server to the >>>> curl message is at the end of the non-hung curl transcript above. >>>> >>>> The configuration for the server is exactly as in the community >>>> download, except for the following: >>>> In neo4j.properties: >>>> org.neo4j.server.http.log.enabled=true >>>> org.neo4j.server.http.log.config=conf/neo4j-http-logging.xml >>>> org.neo4j.server.webserver.port=9494 >>>> org.neo4j.server.webserver.https.port=9493 >>>> org.neo4j.server.webserver.maxthreads=320 >>>> In >>>> wrapper.java.additional=-XX:ParallelGCThreads=32 >>>> wrapper.java.additional=-XX:ConcGCThreads=32 >>>> >>>> I've attached the jstack thread dump and data/graph.db/messages.log >>>> files to this message. The hung curl session looks like: >>>> > curl --trace-ascii - -X POST -H X-Stream:true -H "Content-Type: >>>> application/json" -d '{"query":"start a= node(*) return a"}' >>>> http://localhost:9494/db/data/cypher >>>> == Info: About to connect() to localhost port 9494 (#0) >>>> == Info: Trying 127.0.0.1... == Info: connected >>>> == Info: Connected to localhost (127.0.0.1) port 9494 (#0) >>>> => Send header, 237 bytes (0xed) >>>> 0000: POST /db/data/cypher HTTP/1.1 >>>> 001f: User-Agent: curl/7.19.0 (x86_64-suse-linux-gnu) libcurl/7.19.0 O >>>> 005f: penSSL/0.9.8h zlib/1.2.7 libidn/1.10 >>>> 0085: Host: localhost:9494 >>>> 009b: Accept: */* >>>> 00a8: X-Stream:true >>>> 00b7: Content-Type: application/json >>>> 00d7: Content-Length: 37 >>>> 00eb: >>>> => Send data, 37 bytes (0x25) >>>> 0000: {"query":"start a= node(*) return a"} >>>> ...and at this point it hangs... >>>> >>>> >>>> On Tuesday, March 25, 2014 3:19:36 PM UTC-4, Michael Hunger wrote: >>>> >>>>> Joel, >>>>> >>>>> can you add the X-Stream:true header? >>>>> >>>>> How many nodes do you have in your graph? If you return them all it is >>>>> quite a amount of data that's returned. Without the streaming header, the >>>>> Server builds up the response in memory and that most probably causes GC >>>>> pauses or it just blows up with an OOM. >>>>> >>>>> What is your memory config for your Neo4j Server? Both in terms of >>>>> heap and mmio config? >>>>> >>>>> Any chance to share your data/graph.db/messages.log for some >>>>> diagnostics? >>>>> >>>>> A thread dump in the case when it hangs would be also super helpful, >>>>> either with jstack <pid> or kill -3 <pid> (in the second case they'll end >>>>> up in data/log/console.log) >>>>> >>>>> Thanks so much, >>>>> >>>>> Michael >>>>> >>>>> >>>>> >>>>> >>>>> On Tue, Mar 25, 2014 at 8:04 PM, Joel Welling <[email protected]>wrote: >>>>> >>>>>> Hi folks; >>>>>> I am running neo4j on an SGI UV machine. It has a great many >>>>>> cores, but only a small subset (limited by the cpuset) are available to >>>>>> my >>>>>> neo4j server. If I run neo4j community-2.0.1 with a configuration which >>>>>> is >>>>>> out-of-the-box except for setting -XX:ParallelGCThreads=32 and >>>>>> -XX:ConcGCThreads=32 in my neo4j-wrapper.conf, too many threads are >>>>>> allocated for the cores I actually have. >>>>>> I can prevent this by setting server.webserver.maxthreads to some >>>>>> value, but the REST interface then hangs. For example, here is a curl >>>>>> command which works if maxthreads is not set but hangs if it is set, even >>>>>> with a relatively large value like 320 threads: >>>>>> >>>>>> >>>>>> > curl --trace-ascii - -X POST -H "Content-Type: application/json" -d >>>>>> '{"query":"start a= node(*) return a"}' >>>>>> http://localhost:9494/db/data/cypher >>>>>> == Info: About to connect() to localhost port 9494 (#0) >>>>>> == Info: Trying 127.0.0.1... == Info: connected >>>>>> == Info: Connected to localhost (127.0.0.1) port 9494 (#0) >>>>>> => Send header, 213 bytes (0xd5) >>>>>> 0000: POST /db/data/cypher HTTP/1.1 >>>>>> 001f: User-Agent: curl/7.21.3 (x86_64-unknown-linux-gnu) libcurl/7.21. >>>>>> 005f: 3 OpenSSL/0.9.8h zlib/1.2.7 >>>>>> 007c: Host: localhost:9494 >>>>>> 0092: Accept: */* >>>>>> 009f: Content-Type: application/json >>>>>> 00bf: Content-Length: 37 >>>>>> 00d3: >>>>>> => Send data, 37 bytes (0x25) >>>>>> 0000: {"query":"start a= node(*) return a"} >>>>>> <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<< HANGS AT THIS POINT >>>>>> <= Recv header, 17 bytes (0x11) >>>>>> 0000: HTTP/1.1 200 OK >>>>>> <= Recv header, 47 bytes (0x2f) >>>>>> 0000: Content-Type: application/json; charset=UTF-8 >>>>>> <= Recv header, 32 bytes (0x20) >>>>>> 0000: Access-Control-Allow-Origin: * >>>>>> <= Recv header, 20 bytes (0x14) >>>>>> 0000: Content-Length: 41 >>>>>> <= Recv header, 32 bytes (0x20) >>>>>> 0000: Server: Jetty(9.0.5.v20130815) >>>>>> <= Recv header, 2 bytes (0x2) >>>>>> 0000: >>>>>> <= Recv data, 41 bytes (0x29) >>>>>> 0000: {. "columns" : [ "a" ],. "data" : [ ].} >>>>>> { >>>>>> "columns" : [ "a" ], >>>>>> "data" : [ ] >>>>>> }== Info: Connection #0 to host localhost left intact >>>>>> == Info: Closing connection #0 >>>>>> >>>>>> If I were on a 32-core machine rather than a 2000-core machine, >>>>>> maxthreads=320 would be the default. Thus I'm guessing that something is >>>>>> competing for threads within that 320-thread pool, or else the server is >>>>>> internally calculating a ratio of threads-per-core and that ratio is >>>>>> yielding zero on my machine. Is there any way to work around this? >>>>>> >>>>>> Thanks, >>>>>> -Joel Welling >>>>>> >>>>>> -- >>>>>> You received this message because you are subscribed to the Google >>>>>> Groups "Neo4j" group. >>>>>> To unsubscribe from this group and stop receiving emails from it, >>>>>> send an email to [email protected]. >>>>>> >>>>>> For more options, visit https://groups.google.com/d/optout. >>>>>> >>>>> >>>>> -- >>>> You received this message because you are subscribed to the Google >>>> Groups "Neo4j" group. >>>> To unsubscribe from this group and stop receiving emails from it, send >>>> an email to [email protected]. >>>> For more options, visit https://groups.google.com/d/optout. >>>> >>> >>> -- >> You received this message because you are subscribed to the Google Groups >> "Neo4j" group. >> To unsubscribe from this group and stop receiving emails from it, send an >> email to [email protected]. >> For more options, visit https://groups.google.com/d/optout. >> > > -- You received this message because you are subscribed to the Google Groups "Neo4j" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. For more options, visit https://groups.google.com/d/optout.
