Re: [Neo4j] embedded Neo4J client configuration tuning for large data import

Eugene pr3d4t0r Ciurana Tue, 17 Dec 2013 17:12:26 -0800

On Tuesday, December 17, 2013 6:41:14 PM UTC-6, Michael Hunger wrote:
>
> so are you currently using embedded? Not really sure from your description 
> it looks as if server your topology says embedded ... ??
>
> So what is the server spending it's time on? CPU, IO, IO-Waits, GC ?
> Would it be possible for you to gather some stack traces while it is 
> really busy? Or to connect a profiler and do a quick profiling run?
> Also it might be IO related as it is on linux, see this: 
> http://structr.org/blog/neo4j-performance-on-ext4
>
> With server you can probably get farther by:
>
> #1 using the batching endpoint to send multiple cypher statements with 
> parameters to the db
> #2 try to aggregate data that belongs into subgraphs into one batch
> #3 batch size 30-50k elementso
>
> If not it might make sense to write a unmanaged extension whose external 
> API and protocol you control, e.g pushing json, binary, xml to it
>
> That extension can then use the java-core-API internally with optimized 
> performance, transaction batching etc.
>
> I wrote an simple extension for storing cypher statements at endpoints 
> that you can post JSON or CSV to for writes, yours could be/work similar 
> but probably more domain specific.
>
> see this: https://github.com/jexp/cypher-rs
>


Hi again!


   1. Yes, the DBAgent + R/W Neo4J is using embedded, no Cypher, dealing 
   with Nodes and Relationships from the Java API.  The other machines in the 
   cluster would be Neo4J HA once we get the master R/W going well.  For 
   purposes of this conversation, it's just DBAgent + R/W Neo4J embedded.
   2. Server - I saw only the JMX curves and all CPU, I/O, etc. were rather 
   idle most of the time.  During exception handling, when I tested on my Mac 
   (not on Linux), the occasional exception/memory was accompanied by a spike 
   in CPU activity (50% to 600%).  I'll ask another of our friends and/or Guan 
   to get the profiler output this way soon.
   3. Thanks for the I/O issues link on Linux -- reading as soon as I 
   finish writing this reply.
   4. We are using batching.  We have two different client abstractions:  
   one uses the embedded Neo4J and processes nodes and relationships in 
   batches of 1,000, and one that uses the RESTful API going after a 
   stand-alone Neo4J instance in batches of 100.  The batch size is 
   configurable; we've done as many as 4,000 and 250, respectively, but the 
   frequency of exceptions in Neo4J increases with batch size.  1,000 and 100 
   were our sweet spot, empirical results after a few thousand tests.
   5. Aggregate data -- good to know -- I'll discuss this with Guan; he's 
   the Neo4J expert; I'm the architect/server scalability/high load guy -- 
   will check on it, thx.
   6. We have never been able to get more than 2,000 elements per batch; 
   see next-to-last point.
   7. Unmanaged extension -- that's what DBAgent + ActiveMQ is doing -- 
   we're trying to slow down the commits and serialize them through one thread 
   for relationships, up to 4 threads for nodes.  The other servers may pushed 
   up to 100 million nodes/relationships to the DBAgent, which are waiting to 
   be picked up from the queue.
   8. We discussed internally the Cypher recommendations (I think you or 
   one of your friends told me to pursue that on IRC) but there's some other 
   reason why we are using the Nodes + Relationships native Java API instead.  
   Guan will know.  Waiting for him to add to this thread.

We'll gather these data, then get back to this thread with new information, 
probably tomorrow.  Guan may jump earlier to bring more information.

Thanks and cheers!

pr3d4t0r


 

-- 
You received this message because you are subscribed to the Google Groups 
"Neo4j" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
For more options, visit https://groups.google.com/groups/opt_out.

Re: [Neo4j] embedded Neo4J client configuration tuning for large data import

Reply via email to