Re: [orientdb] Re: Performance of Distributed (3 nodes) cluster with one billion edges

Phillip Henry Thu, 22 Sep 2016 22:50:06 -0700

Hi, Luca.

I have:


1. turned off other nodes in the cluster so there should be no replication 
during import now nor quorums either.

2. average number of edges is about 1000 per vertex. This will be a similar 
amount in the real data but the distribution around this mean will be much 
different (more uniform in test data).

3. I am now auto-sharding on account numbers.

4. sorting is an overhead, albeit outside of Orient. Using the Unix sort 
command failed with "No space left on device". Oops. OK, so I ran my 
program to generate the data again, this time it is ordered by the first 
account number. Performance was much slower as there appeared to be a lot 
of contention for this account (ie, all writes were contending for this 
account, even if the other account had less contention). More randomized 
data was faster.

5. Yes, replication after writing was automatic and relatively fast (25 
minutes for about 200 million payments). Cool.

6. I've mutlithreaded my loader. The details are now:

- using plocal
- using 30 threads
- not using transactions (OrientGraphFactory.getNoTx)
- retrying forever upon write collisions.
- using Orient 2.2.7.
- using -XX:MaxDirectMemorySize:258040m

The good news is I've achieved an initial write throughput of about 
30k/second.

The bad news is I've tried several runs and only been able to achieve 
200mil < number of writes < 300mil.

The first time I tried it, the loader deadlocked. Using jstat showed that 
the deadlock was between 3 threads at:
- 
OOneKeyEntryPerKeyLockManager.acquireLock(OOneKeyEntryPerKeyLockManager.java:173)
- 
OPartitionedLockManager.acquireExclusiveLock(OPartitionedLockManager.java:210)
- 
OOneKeyEntryPerKeyLockManager.acquireLock(OOneKeyEntryPerKeyLockManager.java:171)

The second time it failed was due to a NullPointerException at 
OByteBufferPool.java:297. I've looked at the code and the only way I can 
see this happening is if OByteBufferPool.allocateBuffer throws an error 
(perhaps an OutOfMemoryError in java.nio.Bits.reserveMemory). This 
StackOverflow posting 
(http://stackoverflow.com/questions/8462200/examples-of-forcing-freeing-of-native-memory-direct-bytebuffer-has-allocated-us)
 
seems to indicate that this can happen if the underlying DirectByteBuffer's 
Cleaner doesn't have its clean() method called. 

Alternatively, I followed the SO suggestion and lowered the heap space to a 
mere 1gb (it was 50gb) to make the GC more active. Unfortunately, after a 
good start, the job is still running some 15 hours later with a hugely 
reduced write throughput (~ 7k/s). Jstat shows 4292 full GCs taking a total 
time of 4597s - not great but not hugely awful either. At this rate, the 
remaining 700mil or so payments are going to take another 30 hours.

7. Even with the highest throughput I have achieved, 30k writes per second, 
I'm looking at about 20 hours of loading. We've taken the same data and, 
after trial and error that was not without its own problems, put it into 
Neo4J in 37 minutes. This is a significant difference. It appears that they 
are approaching the problem differently to avoid contention on updating the 
vertices during an edge write.

Thoughts?

Regards,

Phillip


On Thursday, September 15, 2016 at 10:06:44 PM UTC+1, l.garulli wrote:
>
> On 15 September 2016 at 09:54, Phillip Henry <phill...@gmail.com 
> <javascript:>> wrote:
>
>> Hi, Luca.
>>
>
> Hi Phillip,
>
> 3. Yes, default configuration. Apart from adding an index for ACCOUNTS, I 
>> did nothing further.
>>
>
> Ok, so you have writeQuorum="majority" that means 2 sycnhronous writes and 
> 1 asynchronous per transaction.
>  
>
>> 4. Good question. With real data, we expect it to be as you suggest: some 
>> nodes with the majority of the payments (eg, supermarkets). However, for 
>> the test data, payments were assigned randomly and, therefore, should be 
>> uniformly distributed.
>>
>
> What's your average in terms of number of edges? <10, <50, <200, <1000?
>  
>
>> 2. Yes, I tried plocal minutes after posting (d'oh!). I saw a good 
>> improvement. It started about 3 times faster and got faster still (about 10 
>> times faster) by the time I checked this morning on a job running 
>> overnight. However, even though it is now running at about 7k transactions 
>> per second, a billion edges is still going to take about 40 hours. So, I 
>> ask myself: is there anyway I can make it faster still?
>>
>
> Here it's missing the usage of AUTO-SHARDING INDEX. Example:
>
> accountClass.createIndex("Account.number", 
> OClass.INDEX_TYPE.UNIQUE.toString(), (OProgressListener) null, (ODocument) 
> null,
>     "AUTOSHARDING", new String[] { "number" });
>
> In this way you should go more in parallel, because the index is 
> distributed across all the shards (clusters) of Account class. you should 
> have 32 of them by default because you have 32 cores. 
>
> Please let me know if by sorting the from_accounts and with this change if 
> it's much faster.
>
> This is the best you can have out of the box. To push numbers up it's 
> slightly more complicated: you should be sure that transactions go in 
> parallel and they aren't serialized. This is possible by playing with 
> internal OrientDB settings (mainly the distributed workerThreads), by 
> having many clusters per class (You could try with 128 first and see how 
> it's going).
>  
>
>> I assume when I start the servers up in distributed mode once more, the 
>> data will then be distributed across all nodes in the cluster?
>>
>
> That's right.
>  
>
>> 3. I'll return to concurrent, remote inserts when this job has finished. 
>> Hopefully, a smaller batch size will mean there is no degradation in 
>> performance either... FYI: with a somewhat unscientific approach, I was 
>> polling the server JVM with JStack and saw only a single thread doing all 
>> the work and it *seemed* to spend a lot of its time in ODirtyManager on 
>> collection manipulation.
>>
>
> I think it's because you didn't use the AUTO-SHARDING index. Furthermore 
> running distributed, unfortunately, means the tree ridbag is not available 
> (we will support it in the future), so every change to the edges takes a 
> lot of CPU to demarshall and marshall the entire edge list everytime you 
> update a vertex. That's why my recommendation about sorting the vertices.
>  
>
>> I totally appreciate that performance tuning is an empirical science, but 
>> do you have any opinions as to which would probably be faster: 
>> single-threaded plocal or multithreaded remote? 
>>
>
> With v2.2 yo can go in parallel, by using the tips above. For sure the 
> replication has a cost. I'm sure you can go much faster with just one node 
> and then start the other 2 nodes to have the database replicated 
> automatically. At least for the first massive insertion.
>  
>
>>
>> Regards,
>>
>> Phillip
>>
>
> Luca
>
>  
>
>>
>> On Wednesday, September 14, 2016 at 3:48:56 PM UTC+1, Phillip Henry wrote:
>>>
>>> Hi, guys.
>>>
>>> I'm conducting a proof-of-concept for a large bank (Luca, we had a 
>>> 'phone conf on August 5...) and I'm trying to bulk insert a humongous 
>>> amount of data: 1 million vertices and 1 billion edges.
>>>
>>> Firstly, I'm impressed about how easy it was to configure a cluster. 
>>> However, the performance of batch inserting is bad (and seems to get 
>>> considerably worse as I add more data). It starts at about 2k 
>>> vertices-and-edges per second and deteriorates to about 500/second after 
>>> only about 3 million edges have been added. This also takes ~ 30 minutes. 
>>> Needless to say that 1 billion payments (edges) will take over a week at 
>>> this rate. 
>>>
>>> This is a show-stopper for us.
>>>
>>> My data model is simply payments between accounts and I store it in one 
>>> large file. It's just 3 fields and looks like:
>>>
>>> FROM_ACCOUNT TO_ACCOUNT AMOUNT
>>>
>>> In the test data I generated, I had 1 million accounts and 1 billion 
>>> payments randomly distributed between pairs of accounts.
>>>
>>> I have 2 classes in OrientDB: ACCOUNTS (extending V) and PAYMENT 
>>> (extending E). There is a UNIQUE_HASH_INDEX on ACCOUNTS for the account 
>>> number (a string).
>>>
>>> We're using OrientDB 2.2.7.
>>>
>>> My batch size is 5k and I am using the "remote" protocol to connect to 
>>> our cluster.
>>>
>>> I'm using JDK 8 and my 3 boxes are beefy machines (32 cores each) but 
>>> without SSDs. I wrote the importing code myself but did nothing 'clever' (I 
>>> think) and used the Graph API. This client code has been given lots of 
>>> memory and using jstat I can see it is not excessively GCing.
>>>
>>> So, my questions are:
>>>
>>> 1. what kind of performance can I realistically expect and can I improve 
>>> what I have at the moment?
>>>
>>> 2. what kind of degradation should I expect as the graph grows?
>>>
>>> Thanks, guys.
>>>
>>> Phillip
>>>
>>>
>>>
>>> -- 
>>
>> --- 
>> You received this message because you are subscribed to the Google Groups 
>> "OrientDB" group.
>> To unsubscribe from this group and stop receiving emails from it, send an 
>> email to orient-databa...@googlegroups.com <javascript:>.
>> For more options, visit https://groups.google.com/d/optout.
>>
>
>

-- 

--- 
You received this message because you are subscribed to the Google Groups 
"OrientDB" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to orient-database+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Re: [orientdb] Re: Performance of Distributed (3 nodes) cluster with one billion edges

Reply via email to