Re: [Neo4j] embedded Neo4J client configuration tuning for large data import

Eugene pr3d4t0r Ciurana Tue, 17 Dec 2013 17:11:20 -0800


On Tuesday, December 17, 2013 6:41:14 PM UTC-6, Michael Hunger wrote:
>
> so are you currently using embedded? Not really sure from your description 
> it looks as if server your topology says embedded ... ??
>
> So what is the server spending it's time on? CPU, IO, IO-Waits, GC ?
> Would it be possible for you to gather some stack traces while it is 
> really busy? Or to connect a profiler and do a quick profiling run?
> Also it might be IO related as it is on linux, see this: 
> http://structr.org/blog/neo4j-performance-on-ext4
>
> With server you can probably get farther by:
>
> #1 using the batching endpoint to send multiple cypher statements with 
> parameters to the db
> #2 try to aggregate data that belongs into subgraphs into one batch
> #3 batch size 30-50k elementso
>
> If not it might make sense to write a unmanaged extension whose external 
> API and protocol you control, e.g pushing json, binary, xml to it
>
> That extension can then use the java-core-API internally with optimized 
> performance, transaction batching etc.
>
> I wrote an simple extension for storing cypher statements at endpoints 
> that you can post JSON or CSV to for writes, yours could be/work similar 
> but probably more domain specific.
>
> see this: https://github.com/jexp/cypher-rs
>


Hi again!

Yes, the DBAgent + R/W Neo4J is using embedded, no Cypher, dealing with 
Nodes and Relationships from the Java API.  The other machines in the 
cluster would be Neo4J HA once we get the master R/W going well.  For 
purposes of this conversation, it's just DBAgent + R/W Neo4J embedded.
Server - I saw only the JMX curves and all CPU, I/O, etc. were rather idle 
most of the time.  During exception handling, when I tested on my Mac (not 
on Linux), the occasional exception/memory was accompanied by a spike in 
CPU activity (50% to 600%).  I'll ask another of our friends and/or Guan to 
get the profiler output this way soon.
Thanks for the I/O issues link on Linux -- reading as soon as I finish 
writing this reply.
We are using batching.  We have two different client abstractions:  one 
uses the embedded Neo4J and processes nodes and relationships in batches of 
1,000, and one that uses the RESTful API going after a stand-alone Neo4J 
instance in batches of 100.  The batch size is configurable; we've done as 
many as 4,000 and 250, respectively, but the frequency of exceptions in 
Neo4J increases with batch size.  1,000 and 100 were our sweet spot, 
empirical results after a few thousand tests.
Aggregate data -- good to know -- I'll discuss this with Guan; he's the 
Neo4J expert; I'm the architect/server scalability/high load guy -- will 
check on it, thx.
We have never been able to get more than 2,000 elements per batch; see 
next-to-last point.
Unmanaged extension -- that's what DBAgent + ActiveMQ is doing -- we're 
trying to slow down the commits and serialize them through one thread for 
relationships, up to 4 threads for nodes.  The other servers may pushed up 
to 100 million nodes/relationships to the DBAgent, which are waiting to be 
picked up from the queue.
We discussed internally the Cypher recommendations (I think you or one of 
your friends told me to pursue that on IRC) but there's some other reason 
why we are using the Nodes + Relationships native Java API instead.  Guan 
will know.  Waiting for him to add to this thread.

We'll gather these data, then get back to this thread with new information, 
probably tomorrow.  Guan may jump earlier to bring more information.

Thanks and cheers!

pr3d4t0r

-- 
You received this message because you are subscribed to the Google Groups 
"Neo4j" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
For more options, visit https://groups.google.com/groups/opt_out.

Re: [Neo4j] embedded Neo4J client configuration tuning for large data import

Reply via email to