Re: [Neo4j] embedded Neo4J client configuration tuning for large data import

Guan Guan Tue, 17 Dec 2013 18:18:21 -0800

Hi Michael,

Thank you very much for the reply.


We are using neo4j 1.9.4, embedded mode with its JAVA api. 

Here is the code I initialize embedded graphDB

graphDB = new GraphDatabaseFactory()
                    .newEmbeddedDatabaseBuilder(this.path)
                    .setConfig(dbConfig)
                    .newGraphDatabase();

And dbConfig(Map<String,String>) has all memory configurations

          <spring:entry key="neostore.nodestore.db.mapped_memory" value=
"1126M"></spring:entry>

          <spring:entry key="neostore.relationshipstore.db.mapped_memory" 
value="614M"></spring:entry>

          <spring:entry key="neostore.propertystore.db.mapped_memory" value=
"128M"></spring:entry>

          <spring:entry key=
"neostore.propertystore.db.strings.mapped_memory" value="1126M"></spring:
entry>

          <spring:entry key="neostore.propertystore.db.arrays.mapped_memory" 
value="614M"></spring:entry>

          <spring:entry key="allow_store_upgrade" value="true"></spring:
entry>
Am I doing the correct way to configure memory for the embedded database?

And here is the heap config:

# Memory

wrapper.java.additional.6=-Xmx45056m

wrapper.java.additional.7=-XX:+UseConcMarkSweepGC

wrapper.java.additional.8=-XX:-UseGCOverheadLimit

wrapper.java.additional.9=-XX:GCTimeRatio=19

wrapper.java.additional.10=-XX:MaxPermSize=16384m


And some more question. Is the performance improved for Neo4J 2.0 compare 
to 1.9.x? Does the Neo4J 2.0 transaction strategy change? From the 
document, I see that all operations need to be within a transaction. Does 
it help avoid issues for concurrent operations? 

If I wrap a batch of operations in one transaction, for example, creates 
10k nodes in one transaction. Does Neo4J use its memory to keep these 
transactions, may it cause memory leak if the batch number is too big?

Thanks,

Guan


On Tuesday, December 17, 2013 5:16:39 PM UTC-8, Michael Hunger wrote:
>
> I think you might run into contention issues between the different threads 
> inserting nodes and rels.
>
> Mostly around locking. Whenever you connect nodes both will be locked, so 
> other threads have to wait until they can update those nodes too.
> That's where the subgraph aggregation helps.
>
> Usually Neo4j embedded can import 10-20k nodes / s if it is not contending 
> on locks.
>
> Can you share your store-file sizes for nodes, rels, properties?
> And your mmio config?
> And your heap config.
>
> Best, is to share your messages.log which contains all of the above :)
>
> Cheers
>
> Michael
>
> Am 18.12.2013 um 02:11 schrieb Eugene pr3d4t0r Ciurana 
> <[email protected]<javascript:>
> >:
>
> On Tuesday, December 17, 2013 6:41:14 PM UTC-6, Michael Hunger wrote:
>>
>> so are you currently using embedded? Not really sure from your 
>> description it looks as if server your topology says embedded ... ??
>>
>> So what is the server spending it's time on? CPU, IO, IO-Waits, GC ?
>> Would it be possible for you to gather some stack traces while it is 
>> really busy? Or to connect a profiler and do a quick profiling run?
>> Also it might be IO related as it is on linux, see this: 
>> http://structr.org/blog/neo4j-performance-on-ext4
>>
>> With server you can probably get farther by:
>>
>> #1 using the batching endpoint to send multiple cypher statements with 
>> parameters to the db
>> #2 try to aggregate data that belongs into subgraphs into one batch
>> #3 batch size 30-50k elementso
>>
>> If not it might make sense to write a unmanaged extension whose external 
>> API and protocol you control, e.g pushing json, binary, xml to it
>>
>> That extension can then use the java-core-API internally with optimized 
>> performance, transaction batching etc.
>>
>> I wrote an simple extension for storing cypher statements at endpoints 
>> that you can post JSON or CSV to for writes, yours could be/work similar 
>> but probably more domain specific.
>>
>> see this: https://github.com/jexp/cypher-rs
>>
>
> Hi again!
>
>
>    1. Yes, the DBAgent + R/W Neo4J is using embedded, no Cypher, dealing 
>    with Nodes and Relationships from the Java API.  The other machines in the 
>    cluster would be Neo4J HA once we get the master R/W going well.  For 
>    purposes of this conversation, it's just DBAgent + R/W Neo4J embedded.
>    2. Server - I saw only the JMX curves and all CPU, I/O, etc. were 
>    rather idle most of the time.  During exception handling, when I tested on 
>    my Mac (not on Linux), the occasional exception/memory was accompanied by 
> a 
>    spike in CPU activity (50% to 600%).  I'll ask another of our friends 
>    and/or Guan to get the profiler output this way soon.
>    3. Thanks for the I/O issues link on Linux -- reading as soon as I 
>    finish writing this reply.
>    4. We are using batching.  We have two different client abstractions:  
>    one uses the embedded Neo4J and processes nodes and relationships in 
>    batches of 1,000, and one that uses the RESTful API going after a 
>    stand-alone Neo4J instance in batches of 100.  The batch size is 
>    configurable; we've done as many as 4,000 and 250, respectively, but the 
>    frequency of exceptions in Neo4J increases with batch size.  1,000 and 100 
>    were our sweet spot, empirical results after a few thousand tests.
>    5. Aggregate data -- good to know -- I'll discuss this with Guan; he's 
>    the Neo4J expert; I'm the architect/server scalability/high load guy -- 
>    will check on it, thx.
>    6. We have never been able to get more than 2,000 elements per batch; 
>    see next-to-last point.
>    7. Unmanaged extension -- that's what DBAgent + ActiveMQ is doing -- 
>    we're trying to slow down the commits and serialize them through one 
> thread 
>    for relationships, up to 4 threads for nodes.  The other servers may 
> pushed 
>    up to 100 million nodes/relationships to the DBAgent, which are waiting to 
>    be picked up from the queue.
>    8. We discussed internally the Cypher recommendations (I think you or 
>    one of your friends told me to pursue that on IRC) but there's some other 
>    reason why we are using the Nodes + Relationships native Java API instead. 
>  
>    Guan will know.  Waiting for him to add to this thread.
>
> We'll gather these data, then get back to this thread with new 
> information, probably tomorrow.  Guan may jump earlier to bring more 
> information.
>
> Thanks and cheers!
>
> pr3d4t0r
>
>
>  
>
> -- 
> You received this message because you are subscribed to the Google Groups 
> "Neo4j" group.
> To unsubscribe from this group and stop receiving emails from it, send an 
> email to [email protected] <javascript:>.
> For more options, visit https://groups.google.com/groups/opt_out.
>
>
>

-- 
You received this message because you are subscribed to the Google Groups 
"Neo4j" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
For more options, visit https://groups.google.com/groups/opt_out.

Re: [Neo4j] embedded Neo4J client configuration tuning for large data import

Reply via email to