Re: [orientdb] Performance of Distributed (3 nodes) cluster with one billion edges

Luca Garulli Wed, 14 Sep 2016 08:36:11 -0700

Hi Phillip,

I remember about that call :-) I have a few of questions for you:

   1. Are those numbers by running with multiple servers?
   2. How many?
   3. I guess default configuration, right?
   4. How the ACCOUNT vertices are distributed in terms of the number of
   edges? For example, 90% of the ACCOUNTs have less than 20 PAYMENT edges.

Below the first 3 suggestions to run faster, I hope these can be applied to
your case.

(1) Group creation of edges

If you browse the file and you look up or create an ACCOUNT vertex, it
means you can have multiple lookups and save of edges as incremental
operation. It's much more efficient to *order the file by FROM_ACCOUNT* (you
could use the Linux "*sort*" command) and keep open the transaction until
the FROM_ACCOUNT changes.

In this way, you are avoiding to update the vertices multiple times, but
you reduce this operation by grouping the edge creation in the same
transaction. In OrientDB transactions consume heap, so unless you have many
thousands of elements this is fine, otherwise, you should batch this
behavior at blocks of about 5k per transaction.

(2) OrientDB Box

The first optimization is cutting the *TCP/IP* by *embedding* the database
in the same JVM as your application. This is what we call "OrientDB Box".
Even if this looks "weird" many users run in this way. Look at
http://orientdb.com/orientdb-embedded/. Running embedded allows you to
replicate OrientDB across multiple other single servers or other OrientDB
boxes.

The entire application remains the same, but the URL is
*plocal:<database-path>* instead of *remote:*. You could try it to see how
much the network latency between client and server impacts on your numbers.

(3) Concurrency

If you are using multiple servers as master of your insert, I suggest
keeping transaction size much smaller than 5K. You could try with just *16*
(your cores/2) and then check by incrementing it. This is because OrientDB
distributes the used clusters, but if concurrent distributed transactions
work on the same clusters, there will be a lot of waits because of internal
locking.

Best Regards,

Luca Garulli
Founder & CEO
OrientDB LTD <http://orientdb.com/>

Want to share your opinion about OrientDB?
Rate & review us at Gartner's Software Review
<https://www.gartner.com/reviews/survey/home>

On 14 September 2016 at 05:17, Phillip Henry <[email protected]> wrote:

> Hi, guys.
>
> I'm conducting a proof-of-concept for a large bank (Luca, we had a 'phone
> conf on August 5...) and I'm trying to bulk insert a humongous amount of
> data: 1 million vertices and 1 billion edges.
>
> Firstly, I'm impressed about how easy it was to configure a cluster.
> However, the performance of batch inserting is bad (and seems to get
> considerably worse as I add more data). It starts at about 2k
> vertices-and-edges per second and deteriorates to about 500/second after
> only about 3 million edges have been added. This also takes ~ 30 minutes.
> Needless to say that 1 billion payments (edges) will take over a week at
> this rate.
>
> This is a show-stopper for us.
>
> My data model is simply payments between accounts and I store it in one
> large file. It's just 3 fields and looks like:
>
> FROM_ACCOUNT TO_ACCOUNT AMOUNT
>
> In the test data I generated, I had 1 million accounts and 1 billion
> payments randomly distributed between pairs of accounts.
>
> I have 2 classes in OrientDB: ACCOUNTS (extending V) and PAYMENT
> (extending E). There is a UNIQUE_HASH_INDEX on ACCOUNTS for the account
> number (a string).
>
> We're using OrientDB 2.2.7.
>
> My batch size is 5k and I am using the "remote" protocol to connect to our
> cluster.
>
> I'm using JDK 8 and my 3 boxes are beefy machines (32 cores each) but
> without SSDs. I wrote the importing code myself but did nothing 'clever' (I
> think) and used the Graph API. This client code has been given lots of
> memory and using jstat I can see it is not excessively GCing.
>
> So, my questions are:
>
> 1. what kind of performance can I realistically expect and can I improve
> what I have at the moment?
>
> 2. what kind of degradation should I expect as the graph grows?
>
> Thanks, guys.
>
> Phillip
>
>
>
> --
>
> ---
> You received this message because you are subscribed to the Google Groups
> "OrientDB" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to [email protected].
> For more options, visit https://groups.google.com/d/optout.
>

-- 

--- 
You received this message because you are subscribed to the Google Groups 
"OrientDB" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
For more options, visit https://groups.google.com/d/optout.

Re: [orientdb] Performance of Distributed (3 nodes) cluster with one billion edges

Reply via email to