Hi, guys.

I'm conducting a proof-of-concept for a large bank (Luca, we had a 'phone 
conf on August 5...) and I'm trying to bulk insert a humongous amount of 
data: 1 million vertices and 1 billion edges.

Firstly, I'm impressed about how easy it was to configure a cluster. 
However, the performance of batch inserting is bad (and seems to get 
considerably worse as I add more data). It starts at about 2k 
vertices-and-edges per second and deteriorates to about 500/second after 
only about 3 million edges have been added. This also takes ~ 30 minutes. 
Needless to say that 1 billion payments (edges) will take over a week at 
this rate. 

This is a show-stopper for us.

My data model is simply payments between accounts and I store it in one 
large file. It's just 3 fields and looks like:

FROM_ACCOUNT TO_ACCOUNT AMOUNT

In the test data I generated, I had 1 million accounts and 1 billion 
payments randomly distributed between pairs of accounts.

I have 2 classes in OrientDB: ACCOUNTS (extending V) and PAYMENT (extending 
E). There is a UNIQUE_HASH_INDEX on ACCOUNTS for the account number (a 
string).

We're using OrientDB 2.2.7.

My batch size is 5k and I am using the "remote" protocol to connect to our 
cluster.

I'm using JDK 8 and my 3 boxes are beefy machines (32 cores each) but 
without SSDs. I wrote the importing code myself but did nothing 'clever' (I 
think) and used the Graph API. This client code has been given lots of 
memory and using jstat I can see it is not excessively GCing.

So, my questions are:

1. what kind of performance can I realistically expect and can I improve 
what I have at the moment?

2. what kind of degradation should I expect as the graph grows?

Thanks, guys.

Phillip



-- 

--- 
You received this message because you are subscribed to the Google Groups 
"OrientDB" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to orient-database+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Reply via email to