Re: [Neo4j] embedded Neo4J client configuration tuning for large data import

Michael Hunger Tue, 17 Dec 2013 16:42:13 -0800

Hi Eugene,

so are you currently using embedded? Not really sure from your description it 
looks as if server your topology says embedded ... ??


So what is the server spending it's time on? CPU, IO, IO-Waits, GC ?
Would it be possible for you to gather some stack traces while it is really 
busy? Or to connect a profiler and do a quick profiling run?
Also it might be IO related as it is on linux, see this: 
http://structr.org/blog/neo4j-performance-on-ext4

With server you can probably get farther by:

#1 using the batching endpoint to send multiple cypher statements with 
parameters to the db
#2 try to aggregate data that belongs into subgraphs into one batch
#3 batch size 30-50k elementso

If not it might make sense to write a unmanaged extension whose external API 
and protocol you control, e.g pushing json, binary, xml to it

That extension can then use the java-core-API internally with optimized 
performance, transaction batching etc.

I wrote an simple extension for storing cypher statements at endpoints that you 
can post JSON or CSV to for writes, yours could be/work similar but probably 
more domain specific.

see this: https://github.com/jexp/cypher-rs

HTH

Michael

Am 18.12.2013 um 00:03 schrieb Eugene pr3d4t0r Ciurana <[email protected]>:

> Hi Michael - I'm in the same project as Guan.
> 
> Use case:  continuous feed of nodes and relationships, millions per day, 
> 7x24.  Nodes are in the 150 to 10,000 bytes.  Sources and relationships are 
> coming from multiple data sources.  Relationships can be between any two 
> nodes, regardless of source.
> 
> OS:  our integration, testing, and production environments are Linux RHEL, 44 
> GB RAM dedicated to the embedded Neo4J app server.  Our dev workstations are 
> OS X, 16 GB RAM + miscellaneous IDEs and we can push relatively high volume 
> on those.
> 
> Oracle latest JRE on all of the above, tuned for high memory usage, 
> concurrent GC.
> 
> App container:  Mule 3.4.0 with two dedicated flows:  one for node commits, 
> one for relationship commits.
> 
> Nodes and relationships are fed from the high volume cluster via 
> JMS/ActiveMQ.  The Neo4J app server only does two things:  subscribes to 
> ActiveMQ and commits data to the DB.  No other activity or application run 
> there.  There are separate queues for nodes and for relationships, one each, 
> no conversations.
> 
> Topology:  http://eugeneciurana.com/personal/images/Neo4J-topology.png
> 
> Neo4J version:  1.9.4
> 
> Now -- the main issue we experience is because none of us has a full idea of 
> how to tune Neo4J.  Guan has been doing a great job at unearthing the 
> details, but we seem to have hit a wall.  We're able to push 1-10 million 
> nodes or relationships, then we see memory exceptions (Guan can explain more 
> - hopefully he'll see this later).  After checking the servers via JMX and 
> other instrumentation, we see that processors and memory are super lean, and 
> that the machine is mostly idling.  Having tuned (poorly, I admit it) Neo4J 
> stand-alone + REST a few weeks ago, I figured that we somehow need to tell 
> the embedded Neo4J how to map memory -- and to solve some of the other issues 
> that Guan mentioned in his original post.
> 
> So -- the app server has all the resources it needs.  The Java container has 
> more than enough memory.  Everything (JVM, OS, supporting libraries, etc.) is 
> up-to-date.
> 
> Thanks in advance for your help -- we look forward to hearing from you.
> 
> (We at some point in the future intend to cluster...  not yet, though.  We 
> can't even get this to work well with a single instance yet.  You may ignore 
> those "read only" instances in the diagram.)
> 
> Cheers!
> 
> pr3d4t0r
> ----
> On Tuesday, December 17, 2013 3:11:28 PM UTC-6, Michael Hunger wrote:
> What is your use-case? What is the large amount of data you're writing to the 
> graph?
> What OS are you working on?
> And what Neo4j version?
> 
> Increasing the memory mapping settings also helps with writes, esp. the 
> settings for the nodestore and relationship-store, the more of that can be 
> memory mapped the more can be written to in parallel.
> 
> Neo4j supports concurrent writes, but it 
> 
> #1 serializes commits on writing to the transaction log
> #2 locks nodes and relationships if you change properties 
> #3 locks both nodes if you add a relationship
> 
> Common practice is to have a large enough tx size (e.g. 30-50k elements) per 
> commit and also aggregate updates that way that they write to different, 
> disjunct subgraphs of the data.
> 
> HTH,
> 
> 
> Michael
> 
> 
> otherwise see the blog posts refered to from: http://neo4j.org/develop/import
> 
> 
> Am 17.12.2013 um 19:49 schrieb Guan Guan <[email protected]>:
> 
>> Hi,
>> 
>> 
>> 
>> In our use case, we need to do a lot of data importing/updating everyday ( 
>> billions of nodes/relationships ). 
>> 
>> 
>> 
>> What's the way to tuning the configuration to boost performance?
>> 
>> 
>> 
>> Does the kernel config help data ingestion? Settings like 
>> 'neostore.propertystore.db.strings.mapped_memory' is for query cache only, 
>> am I correct? Does these parameter help with data import performance?   
>> 
>> 
>> 
>> 
>> 
>> One more question, how does neo4J embedded do with concurrent data 
>> importing? I have multiple threads writing to embedded database at the same 
>> time, I always get locking exception that multiple threads try to lock the 
>> same relationship. What's the recommend way for multiple threads data 
>> ingestion?
>> 
>> 
>> 
>> 
>> 
>> Thanks,
>> 
>> 
>> 
>> Guan
>> 
>> 
>> -- 
>> You received this message because you are subscribed to the Google Groups 
>> "Neo4j" group.
>> To unsubscribe from this group and stop receiving emails from it, send an 
>> email to [email protected].
>> For more options, visit https://groups.google.com/groups/opt_out.
> 
> 
> -- 
> You received this message because you are subscribed to the Google Groups 
> "Neo4j" group.
> To unsubscribe from this group and stop receiving emails from it, send an 
> email to [email protected].
> For more options, visit https://groups.google.com/groups/opt_out.

-- 
You received this message because you are subscribed to the Google Groups 
"Neo4j" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
For more options, visit https://groups.google.com/groups/opt_out.

Re: [Neo4j] embedded Neo4J client configuration tuning for large data import

Reply via email to