Hi Phillip, I remember about that call :-) I have a few of questions for you:
1. Are those numbers by running with multiple servers? 2. How many? 3. I guess default configuration, right? 4. How the ACCOUNT vertices are distributed in terms of the number of edges? For example, 90% of the ACCOUNTs have less than 20 PAYMENT edges. Below the first 3 suggestions to run faster, I hope these can be applied to your case. (1) Group creation of edges If you browse the file and you look up or create an ACCOUNT vertex, it means you can have multiple lookups and save of edges as incremental operation. It's much more efficient to *order the file by FROM_ACCOUNT* (you could use the Linux "*sort*" command) and keep open the transaction until the FROM_ACCOUNT changes. In this way, you are avoiding to update the vertices multiple times, but you reduce this operation by grouping the edge creation in the same transaction. In OrientDB transactions consume heap, so unless you have many thousands of elements this is fine, otherwise, you should batch this behavior at blocks of about 5k per transaction. (2) OrientDB Box The first optimization is cutting the *TCP/IP* by *embedding* the database in the same JVM as your application. This is what we call "OrientDB Box". Even if this looks "weird" many users run in this way. Look at http://orientdb.com/orientdb-embedded/. Running embedded allows you to replicate OrientDB across multiple other single servers or other OrientDB boxes. The entire application remains the same, but the URL is *plocal:<database-path>* instead of *remote:*. You could try it to see how much the network latency between client and server impacts on your numbers. (3) Concurrency If you are using multiple servers as master of your insert, I suggest keeping transaction size much smaller than 5K. You could try with just *16* (your cores/2) and then check by incrementing it. This is because OrientDB distributes the used clusters, but if concurrent distributed transactions work on the same clusters, there will be a lot of waits because of internal locking. Best Regards, Luca Garulli Founder & CEO OrientDB LTD <http://orientdb.com/> Want to share your opinion about OrientDB? Rate & review us at Gartner's Software Review <https://www.gartner.com/reviews/survey/home> On 14 September 2016 at 05:17, Phillip Henry <[email protected]> wrote: > Hi, guys. > > I'm conducting a proof-of-concept for a large bank (Luca, we had a 'phone > conf on August 5...) and I'm trying to bulk insert a humongous amount of > data: 1 million vertices and 1 billion edges. > > Firstly, I'm impressed about how easy it was to configure a cluster. > However, the performance of batch inserting is bad (and seems to get > considerably worse as I add more data). It starts at about 2k > vertices-and-edges per second and deteriorates to about 500/second after > only about 3 million edges have been added. This also takes ~ 30 minutes. > Needless to say that 1 billion payments (edges) will take over a week at > this rate. > > This is a show-stopper for us. > > My data model is simply payments between accounts and I store it in one > large file. It's just 3 fields and looks like: > > FROM_ACCOUNT TO_ACCOUNT AMOUNT > > In the test data I generated, I had 1 million accounts and 1 billion > payments randomly distributed between pairs of accounts. > > I have 2 classes in OrientDB: ACCOUNTS (extending V) and PAYMENT > (extending E). There is a UNIQUE_HASH_INDEX on ACCOUNTS for the account > number (a string). > > We're using OrientDB 2.2.7. > > My batch size is 5k and I am using the "remote" protocol to connect to our > cluster. > > I'm using JDK 8 and my 3 boxes are beefy machines (32 cores each) but > without SSDs. I wrote the importing code myself but did nothing 'clever' (I > think) and used the Graph API. This client code has been given lots of > memory and using jstat I can see it is not excessively GCing. > > So, my questions are: > > 1. what kind of performance can I realistically expect and can I improve > what I have at the moment? > > 2. what kind of degradation should I expect as the graph grows? > > Thanks, guys. > > Phillip > > > > -- > > --- > You received this message because you are subscribed to the Google Groups > "OrientDB" group. > To unsubscribe from this group and stop receiving emails from it, send an > email to [email protected]. > For more options, visit https://groups.google.com/d/optout. > -- --- You received this message because you are subscribed to the Google Groups "OrientDB" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. For more options, visit https://groups.google.com/d/optout.
