Re: [orientdb] Re: Performance of Distributed (3 nodes) cluster with one billion edges

Luca Garulli Thu, 15 Sep 2016 14:06:53 -0700

On 15 September 2016 at 09:54, Phillip Henry <[email protected]> wrote:


> Hi, Luca.
>

Hi Phillip,

3. Yes, default configuration. Apart from adding an index for ACCOUNTS, I
> did nothing further.
>

Ok, so you have writeQuorum="majority" that means 2 sycnhronous writes and
1 asynchronous per transaction.


> 4. Good question. With real data, we expect it to be as you suggest: some
> nodes with the majority of the payments (eg, supermarkets). However, for
> the test data, payments were assigned randomly and, therefore, should be
> uniformly distributed.
>

What's your average in terms of number of edges? <10, <50, <200, <1000?


> 2. Yes, I tried plocal minutes after posting (d'oh!). I saw a good
> improvement. It started about 3 times faster and got faster still (about 10
> times faster) by the time I checked this morning on a job running
> overnight. However, even though it is now running at about 7k transactions
> per second, a billion edges is still going to take about 40 hours. So, I
> ask myself: is there anyway I can make it faster still?
>

Here it's missing the usage of AUTO-SHARDING INDEX. Example:

accountClass.createIndex("Account.number",
OClass.INDEX_TYPE.UNIQUE.toString(), (OProgressListener) null,
(ODocument) null,
    "AUTOSHARDING", new String[] { "number" });

In this way you should go more in parallel, because the index is
distributed across all the shards (clusters) of Account class. you should
have 32 of them by default because you have 32 cores.

Please let me know if by sorting the from_accounts and with this change if
it's much faster.

This is the best you can have out of the box. To push numbers up it's
slightly more complicated: you should be sure that transactions go in
parallel and they aren't serialized. This is possible by playing with
internal OrientDB settings (mainly the distributed workerThreads), by
having many clusters per class (You could try with 128 first and see how
it's going).


> I assume when I start the servers up in distributed mode once more, the
> data will then be distributed across all nodes in the cluster?
>

That's right.


> 3. I'll return to concurrent, remote inserts when this job has finished.
> Hopefully, a smaller batch size will mean there is no degradation in
> performance either... FYI: with a somewhat unscientific approach, I was
> polling the server JVM with JStack and saw only a single thread doing all
> the work and it *seemed* to spend a lot of its time in ODirtyManager on
> collection manipulation.
>

I think it's because you didn't use the AUTO-SHARDING index. Furthermore
running distributed, unfortunately, means the tree ridbag is not available
(we will support it in the future), so every change to the edges takes a
lot of CPU to demarshall and marshall the entire edge list everytime you
update a vertex. That's why my recommendation about sorting the vertices.


> I totally appreciate that performance tuning is an empirical science, but
> do you have any opinions as to which would probably be faster:
> single-threaded plocal or multithreaded remote?
>

With v2.2 yo can go in parallel, by using the tips above. For sure the
replication has a cost. I'm sure you can go much faster with just one node
and then start the other 2 nodes to have the database replicated
automatically. At least for the first massive insertion.


>
> Regards,
>
> Phillip
>

Luca



>
> On Wednesday, September 14, 2016 at 3:48:56 PM UTC+1, Phillip Henry wrote:
>>
>> Hi, guys.
>>
>> I'm conducting a proof-of-concept for a large bank (Luca, we had a 'phone
>> conf on August 5...) and I'm trying to bulk insert a humongous amount of
>> data: 1 million vertices and 1 billion edges.
>>
>> Firstly, I'm impressed about how easy it was to configure a cluster.
>> However, the performance of batch inserting is bad (and seems to get
>> considerably worse as I add more data). It starts at about 2k
>> vertices-and-edges per second and deteriorates to about 500/second after
>> only about 3 million edges have been added. This also takes ~ 30 minutes.
>> Needless to say that 1 billion payments (edges) will take over a week at
>> this rate.
>>
>> This is a show-stopper for us.
>>
>> My data model is simply payments between accounts and I store it in one
>> large file. It's just 3 fields and looks like:
>>
>> FROM_ACCOUNT TO_ACCOUNT AMOUNT
>>
>> In the test data I generated, I had 1 million accounts and 1 billion
>> payments randomly distributed between pairs of accounts.
>>
>> I have 2 classes in OrientDB: ACCOUNTS (extending V) and PAYMENT
>> (extending E). There is a UNIQUE_HASH_INDEX on ACCOUNTS for the account
>> number (a string).
>>
>> We're using OrientDB 2.2.7.
>>
>> My batch size is 5k and I am using the "remote" protocol to connect to
>> our cluster.
>>
>> I'm using JDK 8 and my 3 boxes are beefy machines (32 cores each) but
>> without SSDs. I wrote the importing code myself but did nothing 'clever' (I
>> think) and used the Graph API. This client code has been given lots of
>> memory and using jstat I can see it is not excessively GCing.
>>
>> So, my questions are:
>>
>> 1. what kind of performance can I realistically expect and can I improve
>> what I have at the moment?
>>
>> 2. what kind of degradation should I expect as the graph grows?
>>
>> Thanks, guys.
>>
>> Phillip
>>
>>
>>
>> --
>
> ---
> You received this message because you are subscribed to the Google Groups
> "OrientDB" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to [email protected].
> For more options, visit https://groups.google.com/d/optout.
>

-- 

--- 
You received this message because you are subscribed to the Google Groups 
"OrientDB" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
For more options, visit https://groups.google.com/d/optout.

Re: [orientdb] Re: Performance of Distributed (3 nodes) cluster with one billion edges

Reply via email to