Re: [Neo4j] embedded Neo4J client configuration tuning for large data import

Michael Hunger Tue, 17 Dec 2013 18:40:21 -0800

Regarding 1.9 vs. 2.0

If your project is just starting then I would recommend 2.0 (2.1 with more 
performance enhancements is the next version).


If you're already in production I would probably wait for 2.1 if you don't want 
to upgrade too many pieces of your application now.

Feel free to share more of your code with us (also privately) to help you with 
a code-review.

Michael

Am 18.12.2013 um 03:33 schrieb Michael Hunger 
<[email protected]>:

> How large are your store-files on disk?
> 
> I would increase the memory mapped relationship-store file-size
> 
> Why do you configure your perm-size to 16GB ? I would leave it like it is, or 
> at most 1-2G.
> 
> I don't think you need 45G heap, actually that will limit your available 
> memory for memory mapping.
> 
> I'd go with 16GB heap (Xms and Xmx) and use the other 25G for mmio 
> _depending_ on your store-file sizes.
> 
> e.g. 2G for nodes 10G for rels and 5G for properties,strings, arrays each
> 
> Also the gcr cache of the enterprise version is more memory efficient and 
> gc-friendly.
> 
> 
> The transaction model only changed slightly with no relevance for write 
> operations.
> The internals of the kernel changed a lot though.
> 
> Yes, Neo4j keeps the transactions in memory, but as 10k nodes should be done 
> with in a second or two it shouldn't keep around for long.
> 
> Please have a look at the thread-dumps under load, I assume many of your 
> threads are blocked by each other. Try to group updates by subgraph.
> 
> It would make sense to get someone with experience in Neo4j to look at your 
> setup / config. Where are you located?
> 
> Perhaps you can also explain more about the datamodel.
> 
> Michael
> 
> Am 18.12.2013 um 03:17 schrieb Guan Guan <[email protected]>:
> 
>> Hi Michael,
>> 
>> Thank you very much for the reply. 
>> 
>> We are using neo4j 1.9.4, embedded mode with its JAVA api. 
>> 
>> Here is the code I initialize embedded graphDB
>> 
>> graphDB = new GraphDatabaseFactory()
>>                     .newEmbeddedDatabaseBuilder(this.path)
>>                     .setConfig(dbConfig)
>>                     .newGraphDatabase();
>> 
>> And dbConfig(Map<String,String>) has all memory configurations
>> 
>>           <spring:entry key="neostore.nodestore.db.mapped_memory" 
>> value="1126M"></spring:entry>
>> 
>>           <spring:entry key="neostore.relationshipstore.db.mapped_memory" 
>> value="614M"></spring:entry>
>> 
>>           <spring:entry key="neostore.propertystore.db.mapped_memory" 
>> value="128M"></spring:entry>
>> 
>>           <spring:entry 
>> key="neostore.propertystore.db.strings.mapped_memory" 
>> value="1126M"></spring:entry>
>> 
>>           <spring:entry key="neostore.propertystore.db.arrays.mapped_memory" 
>> value="614M"></spring:entry>
>> 
>>           <spring:entry key="allow_store_upgrade" 
>> value="true"></spring:entry>
>> 
>> Am I doing the correct way to configure memory for the embedded database?
>> 
>> And here is the heap config:
>> 
>> # Memory
>> 
>> wrapper.java.additional.6=-Xmx45056m
>> 
>> wrapper.java.additional.7=-XX:+UseConcMarkSweepGC
>> 
>> wrapper.java.additional.8=-XX:-UseGCOverheadLimit
>> 
>> wrapper.java.additional.9=-XX:GCTimeRatio=19
>> 
>> wrapper.java.additional.10=-XX:MaxPermSize=16384m
>> 
>> 
>> 
>> And some more question. Is the performance improved for Neo4J 2.0 compare to 
>> 1.9.x? Does the Neo4J 2.0 transaction strategy change? From the document, I 
>> see that all operations need to be within a transaction. Does it help avoid 
>> issues for concurrent operations? 
>> 
>> If I wrap a batch of operations in one transaction, for example, creates 10k 
>> nodes in one transaction. Does Neo4J use its memory to keep these 
>> transactions, may it cause memory leak if the batch number is too big?
>> 
>> Thanks,
>> 
>> Guan
>> 
>> 
>> On Tuesday, December 17, 2013 5:16:39 PM UTC-8, Michael Hunger wrote:
>> I think you might run into contention issues between the different threads 
>> inserting nodes and rels.
>> 
>> Mostly around locking. Whenever you connect nodes both will be locked, so 
>> other threads have to wait until they can update those nodes too.
>> That's where the subgraph aggregation helps.
>> 
>> Usually Neo4j embedded can import 10-20k nodes / s if it is not contending 
>> on locks.
>> 
>> Can you share your store-file sizes for nodes, rels, properties?
>> And your mmio config?
>> And your heap config.
>> 
>> Best, is to share your messages.log which contains all of the above :)
>> 
>> Cheers
>> 
>> Michael
>> 
>> Am 18.12.2013 um 02:11 schrieb Eugene pr3d4t0r Ciurana <[email protected]>:
>> 
>>> On Tuesday, December 17, 2013 6:41:14 PM UTC-6, Michael Hunger wrote:
>>> so are you currently using embedded? Not really sure from your description 
>>> it looks as if server your topology says embedded ... ??
>>> 
>>> So what is the server spending it's time on? CPU, IO, IO-Waits, GC ?
>>> Would it be possible for you to gather some stack traces while it is really 
>>> busy? Or to connect a profiler and do a quick profiling run?
>>> Also it might be IO related as it is on linux, see this: 
>>> http://structr.org/blog/neo4j-performance-on-ext4
>>> 
>>> With server you can probably get farther by:
>>> 
>>> #1 using the batching endpoint to send multiple cypher statements with 
>>> parameters to the db
>>> #2 try to aggregate data that belongs into subgraphs into one batch
>>> #3 batch size 30-50k elementso
>>> 
>>> If not it might make sense to write a unmanaged extension whose external 
>>> API and protocol you control, e.g pushing json, binary, xml to it
>>> 
>>> That extension can then use the java-core-API internally with optimized 
>>> performance, transaction batching etc.
>>> 
>>> I wrote an simple extension for storing cypher statements at endpoints that 
>>> you can post JSON or CSV to for writes, yours could be/work similar but 
>>> probably more domain specific.
>>> 
>>> see this: https://github.com/jexp/cypher-rs
>>> 
>>> Hi again!
>>> 
>>> Yes, the DBAgent + R/W Neo4J is using embedded, no Cypher, dealing with 
>>> Nodes and Relationships from the Java API.  The other machines in the 
>>> cluster would be Neo4J HA once we get the master R/W going well.  For 
>>> purposes of this conversation, it's just DBAgent + R/W Neo4J embedded.
>>> Server - I saw only the JMX curves and all CPU, I/O, etc. were rather idle 
>>> most of the time.  During exception handling, when I tested on my Mac (not 
>>> on Linux), the occasional exception/memory was accompanied by a spike in 
>>> CPU activity (50% to 600%).  I'll ask another of our friends and/or Guan to 
>>> get the profiler output this way soon.
>>> Thanks for the I/O issues link on Linux -- reading as soon as I finish 
>>> writing this reply.
>>> We are using batching.  We have two different client abstractions:  one 
>>> uses the embedded Neo4J and processes nodes and relationships in batches of 
>>> 1,000, and one that uses the RESTful API going after a stand-alone Neo4J 
>>> instance in batches of 100.  The batch size is configurable; we've done as 
>>> many as 4,000 and 250, respectively, but the frequency of exceptions in 
>>> Neo4J increases with batch size.  1,000 and 100 were our sweet spot, 
>>> empirical results after a few thousand tests.
>>> Aggregate data -- good to know -- I'll discuss this with Guan; he's the 
>>> Neo4J expert; I'm the architect/server scalability/high load guy -- will 
>>> check on it, thx.
>>> We have never been able to get more than 2,000 elements per batch; see 
>>> next-to-last point.
>>> Unmanaged extension -- that's what DBAgent + ActiveMQ is doing -- we're 
>>> trying to slow down the commits and serialize them through one thread for 
>>> relationships, up to 4 threads for nodes.  The other servers may pushed up 
>>> to 100 million nodes/relationships to the DBAgent, which are waiting to be 
>>> picked up from the queue.
>>> We discussed internally the Cypher recommendations (I think you or one of 
>>> your friends told me to pursue that on IRC) but there's some other reason 
>>> why we are using the Nodes + Relationships native Java API instead.  Guan 
>>> will know.  Waiting for him to add to this thread.
>>> We'll gather these data, then get back to this thread with new information, 
>>> probably tomorrow.  Guan may jump earlier to bring more information.
>>> 
>>> Thanks and cheers!
>>> 
>>> pr3d4t0r
>>> 
>>> 
>>>  
>>> 
>>> -- 
>>> You received this message because you are subscribed to the Google Groups 
>>> "Neo4j" group.
>>> To unsubscribe from this group and stop receiving emails from it, send an 
>>> email to [email protected].
>>> For more options, visit https://groups.google.com/groups/opt_out.
>> 
>> 
>> -- 
>> You received this message because you are subscribed to the Google Groups 
>> "Neo4j" group.
>> To unsubscribe from this group and stop receiving emails from it, send an 
>> email to [email protected].
>> For more options, visit https://groups.google.com/groups/opt_out.
> 

-- 
You received this message because you are subscribed to the Google Groups 
"Neo4j" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
For more options, visit https://groups.google.com/groups/opt_out.

Re: [Neo4j] embedded Neo4J client configuration tuning for large data import

Reply via email to