Re: [orientdb] Re: Performance of Distributed (3 nodes) cluster with one billion edges

Phillip Henry Thu, 29 Sep 2016 22:47:33 -0700

Hi, Andrey.

I was using 2.2.10 but just to be sure, I ran it a second time making sure 
that 2.2.10 was the first thing in my classpath and I am afraid that I saw 
it again. It's quite predictable (anywhere between 200 and 250 million 
edges).


Regards,

Phillip

On Monday, September 26, 2016 at 9:06:44 AM UTC+1, Andrey Lomakin wrote:
>
> Hi,
> I have looked at your thread dump we have already identified and fixed 
> your issue in 2.2.9 version. 
> So if you use 2.2.10 (latest one), you will not experience this problem.
>
> I strongly recommend using 2.2.10 version because several deadlocks are 
> fixed in 2.2.9 version also 2.2.10 contains few minor optimizations.
>
> On Fri, Sep 23, 2016 at 6:51 PM Phillip Henry <phill...@gmail.com 
> <javascript:>> wrote:
>
>> Hi, Luca.
>>
>> > How many GB?
>>
>> The input file is 22gb of text.
>>
>> > If the file is ordered ...
>>
>> You are only sorting by the first account. The second account can be 
>> anywhere in the entire range. My understanding is that both vertices are 
>> updated when an edge is written. If this is true, will there not be 
>> potential contention when the "to" vertex is updated?
>>
>> > OGraphBatchInsert ... keeps everything in RAM before flushing
>>
>> I assume I will still have to write retry code in the event of a 
>> collision (see above)?
>>
>> > You cna use support --at- orientdb.com ... 
>>
>> Sent.
>>
>> Regards,
>>
>> Phill
>>
>> On Friday, September 23, 2016 at 4:06:49 PM UTC+1, l.garulli wrote:
>>
>>> On 23 September 2016 at 03:50, Phillip Henry <phill...@gmail.com> wrote:
>>>
>>>> > How big is your file the sort cannot write?
>>>>
>>>> One bil-ee-on lines... :-P
>>>>
>>>
>>> How many GB?
>>>  
>>>
>>>> > ...This should help a lot. 
>>>>
>>>> The trouble is that the size of a block of contiguous accounts in the 
>>>> real data is not-uniform (even if it might be with my test data). 
>>>> Therefore, it is highly likely a contiguous block of account numbers will 
>>>> span 2 or more batches. This will lead to a lot of contention. In your 
>>>> example, if Account 2 spills over into the next batch, chances are I'll 
>>>> have to rollback that batch.
>>>>
>>>> Don't you also have a problem that if X, Y, Z and W in your example are 
>>>> account numbers in the next batch, you'll also get contention? Admittedly, 
>>>> randomization doesn't solve this problem either.
>>>>
>>>
>>> If the file is ordered, you could have X threads (where X is the number 
>>> of cores) that parse the file not sequentially. For example with 4 threads, 
>>> you could start the parsing in this way:
>>>
>>> Thread 1, starts from 0
>>> Thread 2, starts from length * 1/4
>>> Thread 3, starts from length * 2/4
>>> Thread 1, starts from length * 3/4
>>>  
>>> Of course the parsing should browse until the next LF+LR if it's a CSV. 
>>> It requires some lines of code, but you could avoid many conflicts.
>>>
>>>
>>>> > you can use the special Batch Importer: OGraphBatchInsert
>>>>
>>>> Would this not be subject to the same contention problems?
>>>> At what point is it flushed to disk? (Obviously, it can't live in heap 
>>>> forever).
>>>>
>>>
>>> It keeps everything in RAM before flushing. Up to a few hundreds of 
>>> millions of vertices/edges should be fine if you have a lot of heap, like 
>>> 58GB (and 4GB of DISKCACHE). It depends by the number of attributes you 
>>> have.
>>>  
>>>
>>>> > You should definitely using transactions with batch size of 100 
>>>> items. 
>>>>
>>>> I thought I read somewhere else (can't find the link at the moment) 
>>>> that you said only use transactions when using the remote protocol?
>>>>
>>>
>>> This was true before v2.2. With v2.2 the management of the transaction 
>>> is parallel and very light. Transactions work well with graphs because 
>>> every addEdge() operation is 2 update and having a TX that works like a 
>>> batch really helps.
>>>  
>>>
>>>>
>>>> > Please use last 2.2.10. ... try to define 50GB of DISKCACHE and 14GB 
>>>> of Heap
>>>>
>>>> Will do on the next run.
>>>>
>>>> > If happens again, could you please send a thread dump?
>>>>
>>>> I have the full thread dump but it's on my work machine so can't post 
>>>> it in this forum (all access to Google Groups is banned by the bank so I 
>>>> am 
>>>> writing this on my personal computer). Happy to email them to you. Which 
>>>> email shall I use?
>>>>
>>>
>>> You cna use support --at- orientdb.com referring at this thread in the 
>>> subject.
>>>  
>>>
>>>>
>>>> Phill
>>>>
>>>
>>>
>>> Best Regards,
>>>
>>> Luca Garulli
>>> Founder & CEO
>>> OrientDB LTD <http://orientdb.com/>
>>>
>>> Want to share your opinion about OrientDB?
>>> Rate & review us at Gartner's Software Review 
>>> <https://www.gartner.com/reviews/survey/home>
>>>
>>>  
>>>
>>>> On Friday, September 23, 2016 at 7:41:29 AM UTC+1, l.garulli wrote:
>>>>
>>>>> On 23 September 2016 at 00:49, Phillip Henry <phill...@gmail.com> 
>>>>> wrote:
>>>>>
>>>>>> Hi, Luca.
>>>>>>
>>>>>
>>>>> Hi Phillip.
>>>>>  
>>>>>
>>>>>> I have:
>>>>>>
>>>>>> 4. sorting is an overhead, albeit outside of Orient. Using the Unix 
>>>>>> sort command failed with "No space left on device". Oops. OK, so I ran 
>>>>>> my 
>>>>>> program to generate the data again, this time it is ordered by the first 
>>>>>> account number. Performance was much slower as there appeared to be a 
>>>>>> lot 
>>>>>> of contention for this account (ie, all writes were contending for this 
>>>>>> account, even if the other account had less contention). More randomized 
>>>>>> data was faster.
>>>>>>
>>>>>
>>>>> How big is your file the sort cannot write? Anyway, if you have the 
>>>>> accounts sorted, you should have transactions of about 100 items where 
>>>>> the 
>>>>> bank account and edges are in the same transaction. This should help a 
>>>>> lot. 
>>>>> Example:
>>>>>
>>>>> Account 1 -> Payment 1 -> Account X
>>>>> Account 1 -> Payment 2 -> Account Y
>>>>> Account 1 -> Payment 3 -> Account Z
>>>>> Account 2 -> Payment 1 -> Account X
>>>>> Account 2 -> Payment 1 -> Account W
>>>>>
>>>>> If the transaction batch is 5 (I suggest you to start with 100), all 
>>>>> the operations are executed in one transaction. In another thread has:
>>>>>
>>>>> Account 99 -> Payment 1 -> Account W
>>>>>
>>>>> It could go in conflict because the shared Account W.
>>>>>
>>>>> If you can export Account's IDs that are numbers and incremental, you 
>>>>> can use the special Batch Importer: OGraphBatchInsert. Example:
>>>>>
>>>>> OGraphBatchInsert batch = new OGraphBatchInsert("plocal:/temp/mydb", 
>>>>> "admin", "admin");
>>>>> batch.begin();
>>>>>
>>>>> batch.createEdge(0L, 1L, null); // CREATE EDGES BETWEEN VERTEX 0 and 1. 
>>>>> IF VERTICES
>>>>>
>>>>>                                 // DON'T EXISTS, ARE CREATED IMPLICITELY
>>>>> batch.createEdge(1L, 2L, null);
>>>>> batch.createEdge(2L, 0L, null);
>>>>>
>>>>>
>>>>> batch.createVertex(3L); // CREATE AN NON CONNECTED VERTEX
>>>>>
>>>>>
>>>>> Map<String, Object> vertexProps = new HashMap<String, Object>();
>>>>> vertexProps.put("foo", "foo");
>>>>> vertexProps.put("bar", 3);
>>>>> batch.setVertexProperties(0L, vertexProps); // SET PROPERTY FOR VERTEX 0
>>>>> batch.end();
>>>>>
>>>>> This is blazing fast, but uses Heap so run it with a lot of it.
>>>>>  
>>>>>
>>>>>>
>>>>>> 6. I've mutlithreaded my loader. The details are now:
>>>>>>
>>>>>> - using plocal
>>>>>> - using 30 threads
>>>>>> - not using transactions (OrientGraphFactory.getNoTx)
>>>>>>
>>>>>
>>>>> You should definitely using transactions with batch size of 100 items. 
>>>>> This speeds up things.
>>>>>  
>>>>>
>>>>>> - retrying forever upon write collisions.
>>>>>> - using Orient 2.2.7.
>>>>>>
>>>>>
>>>>> Please use last 2.2.10.
>>>>>  
>>>>>
>>>>>> - using -XX:MaxDirectMemorySize:258040m
>>>>>>
>>>>>
>>>>> This is not really important, it's just an upper bound for the JVM. 
>>>>> Please set it to 512GB so you can forget about it. The 2 most important 
>>>>> values are DISKCACHE and JVM heap. The sum must lower than the available 
>>>>> RAM in the server before you run OrientDB.
>>>>>
>>>>> If you have 64GB, try to define 50GB of DISKCACHE and 14GB of Heap.
>>>>>
>>>>> If you use the Batch Importer, you should use more Heap and less 
>>>>> DISKCACHE.
>>>>>  
>>>>>
>>>>>> The good news is I've achieved an initial write throughput of about 
>>>>>> 30k/second.
>>>>>>
>>>>>> The bad news is I've tried several runs and only been able to achieve 
>>>>>> 200mil < number of writes < 300mil.
>>>>>>
>>>>>> The first time I tried it, the loader deadlocked. Using jstat showed 
>>>>>> that the deadlock was between 3 threads at:
>>>>>> - 
>>>>>> OOneKeyEntryPerKeyLockManager.acquireLock(OOneKeyEntryPerKeyLockManager.java:173)
>>>>>> - 
>>>>>> OPartitionedLockManager.acquireExclusiveLock(OPartitionedLockManager.java:210)
>>>>>> - 
>>>>>> OOneKeyEntryPerKeyLockManager.acquireLock(OOneKeyEntryPerKeyLockManager.java:171)
>>>>>>
>>>>>
>>>>> If happens again, could you please send a thread dump?
>>>>>  
>>>>>
>>>>>> The second time it failed was due to a NullPointerException at 
>>>>>> OByteBufferPool.java:297. I've looked at the code and the only way I can 
>>>>>> see this happening is if OByteBufferPool.allocateBuffer throws an error 
>>>>>> (perhaps an OutOfMemoryError in java.nio.Bits.reserveMemory). This 
>>>>>> StackOverflow posting (
>>>>>> http://stackoverflow.com/questions/8462200/examples-of-forcing-freeing-of-native-memory-direct-bytebuffer-has-allocated-us)
>>>>>>  
>>>>>> seems to indicate that this can happen if the underlying 
>>>>>> DirectByteBuffer's 
>>>>>> Cleaner doesn't have its clean() method called. 
>>>>>>
>>>>>
>>>>> This is because the database was bigger than this setting: - using 
>>>>> -XX:MaxDirectMemorySize:258040m. Please set this at 512GB (see above).
>>>>>  
>>>>>
>>>>>> Alternatively, I followed the SO suggestion and lowered the heap 
>>>>>> space to a mere 1gb (it was 50gb) to make the GC more active. 
>>>>>> Unfortunately, after a good start, the job is still running some 15 
>>>>>> hours 
>>>>>> later with a hugely reduced write throughput (~ 7k/s). Jstat shows 4292 
>>>>>> full GCs taking a total time of 4597s - not great but not hugely awful 
>>>>>> either. At this rate, the remaining 700mil or so payments are going to 
>>>>>> take 
>>>>>> another 30 hours.
>>>>>>
>>>>>
>>>>> See above the suggested settings.
>>>>>  
>>>>>
>>>>>> 7. Even with the highest throughput I have achieved, 30k writes per 
>>>>>> second, I'm looking at about 20 hours of loading. We've taken the same 
>>>>>> data 
>>>>>> and, after trial and error that was not without its own problems, put it 
>>>>>> into Neo4J in 37 minutes. This is a significant difference. It appears 
>>>>>> that 
>>>>>> they are approaching the problem differently to avoid contention on 
>>>>>> updating the vertices during an edge write.
>>>>>>
>>>>>
>>>>> With all this suggestion you should be able to have much better 
>>>>> numbers. If you can use the Batch Importer the number should be close to 
>>>>> Neo4j.
>>>>>  
>>>>>
>>>>>>
>>>>>> Thoughts?
>>>>>>
>>>>>> Regards,
>>>>>>
>>>>>> Phillip
>>>>>>
>>>>>>
>>>>>
>>>>> Best Regards,
>>>>>
>>>>> Luca Garulli
>>>>> Founder & CEO
>>>>> OrientDB LTD <http://orientdb.com/>
>>>>>
>>>>> Want to share your opinion about OrientDB?
>>>>> Rate & review us at Gartner's Software Review 
>>>>> <https://www.gartner.com/reviews/survey/home>
>>>>>
>>>>>
>>>>>  
>>>>>
>>>>>>
>>>>>> On Thursday, September 15, 2016 at 10:06:44 PM UTC+1, l.garulli wrote:
>>>>>>>
>>>>>>> On 15 September 2016 at 09:54, Phillip Henry <phill...@gmail.com> 
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Hi, Luca.
>>>>>>>>
>>>>>>>
>>>>>>> Hi Phillip,
>>>>>>>
>>>>>>> 3. Yes, default configuration. Apart from adding an index for 
>>>>>>>> ACCOUNTS, I did nothing further.
>>>>>>>>
>>>>>>>
>>>>>>> Ok, so you have writeQuorum="majority" that means 2 sycnhronous 
>>>>>>> writes and 1 asynchronous per transaction.
>>>>>>>  
>>>>>>>
>>>>>>>> 4. Good question. With real data, we expect it to be as you 
>>>>>>>> suggest: some nodes with the majority of the payments (eg, 
>>>>>>>> supermarkets). 
>>>>>>>> However, for the test data, payments were assigned randomly and, 
>>>>>>>> therefore, 
>>>>>>>> should be uniformly distributed.
>>>>>>>>
>>>>>>>
>>>>>>> What's your average in terms of number of edges? <10, <50, <200, 
>>>>>>> <1000?
>>>>>>>  
>>>>>>>
>>>>>>>> 2. Yes, I tried plocal minutes after posting (d'oh!). I saw a good 
>>>>>>>> improvement. It started about 3 times faster and got faster still 
>>>>>>>> (about 10 
>>>>>>>> times faster) by the time I checked this morning on a job running 
>>>>>>>> overnight. However, even though it is now running at about 7k 
>>>>>>>> transactions 
>>>>>>>> per second, a billion edges is still going to take about 40 hours. So, 
>>>>>>>> I 
>>>>>>>> ask myself: is there anyway I can make it faster still?
>>>>>>>>
>>>>>>>
>>>>>>> Here it's missing the usage of AUTO-SHARDING INDEX. Example:
>>>>>>>
>>>>>>> accountClass.createIndex("Account.number", 
>>>>>>> OClass.INDEX_TYPE.UNIQUE.toString(), (OProgressListener) null, 
>>>>>>> (ODocument) null,
>>>>>>>     "AUTOSHARDING", new String[] { "number" });
>>>>>>>
>>>>>>> In this way you should go more in parallel, because the index is 
>>>>>>> distributed across all the shards (clusters) of Account class. you 
>>>>>>> should 
>>>>>>> have 32 of them by default because you have 32 cores. 
>>>>>>>
>>>>>>> Please let me know if by sorting the from_accounts and with this 
>>>>>>> change if it's much faster.
>>>>>>>
>>>>>>> This is the best you can have out of the box. To push numbers up 
>>>>>>> it's slightly more complicated: you should be sure that transactions go 
>>>>>>> in 
>>>>>>> parallel and they aren't serialized. This is possible by playing with 
>>>>>>> internal OrientDB settings (mainly the distributed workerThreads), by 
>>>>>>> having many clusters per class (You could try with 128 first and see 
>>>>>>> how 
>>>>>>> it's going).
>>>>>>>  
>>>>>>>
>>>>>>>> I assume when I start the servers up in distributed mode once more, 
>>>>>>>> the data will then be distributed across all nodes in the cluster?
>>>>>>>>
>>>>>>>
>>>>>>> That's right.
>>>>>>>  
>>>>>>>
>>>>>>>> 3. I'll return to concurrent, remote inserts when this job has 
>>>>>>>> finished. Hopefully, a smaller batch size will mean there is no 
>>>>>>>> degradation 
>>>>>>>> in performance either... FYI: with a somewhat unscientific approach, I 
>>>>>>>> was 
>>>>>>>> polling the server JVM with JStack and saw only a single thread doing 
>>>>>>>> all 
>>>>>>>> the work and it *seemed* to spend a lot of its time in ODirtyManager 
>>>>>>>> on 
>>>>>>>> collection manipulation.
>>>>>>>>
>>>>>>>
>>>>>>> I think it's because you didn't use the AUTO-SHARDING index. 
>>>>>>> Furthermore running distributed, unfortunately, means the tree ridbag 
>>>>>>> is 
>>>>>>> not available (we will support it in the future), so every change to 
>>>>>>> the 
>>>>>>> edges takes a lot of CPU to demarshall and marshall the entire edge 
>>>>>>> list 
>>>>>>> everytime you update a vertex. That's why my recommendation about 
>>>>>>> sorting 
>>>>>>> the vertices.
>>>>>>>  
>>>>>>>
>>>>>>>> I totally appreciate that performance tuning is an empirical 
>>>>>>>> science, but do you have any opinions as to which would probably be 
>>>>>>>> faster: 
>>>>>>>> single-threaded plocal or multithreaded remote? 
>>>>>>>>
>>>>>>>
>>>>>>> With v2.2 yo can go in parallel, by using the tips above. For sure 
>>>>>>> the replication has a cost. I'm sure you can go much faster with just 
>>>>>>> one 
>>>>>>> node and then start the other 2 nodes to have the database replicated 
>>>>>>> automatically. At least for the first massive insertion.
>>>>>>>  
>>>>>>>
>>>>>>>>
>>>>>>>> Regards,
>>>>>>>>
>>>>>>>> Phillip
>>>>>>>>
>>>>>>>
>>>>>>> Luca
>>>>>>>
>>>>>>>  
>>>>>>>
>>>>>>>>
>>>>>>>> On Wednesday, September 14, 2016 at 3:48:56 PM UTC+1, Phillip Henry 
>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>> Hi, guys.
>>>>>>>>>
>>>>>>>>> I'm conducting a proof-of-concept for a large bank (Luca, we had a 
>>>>>>>>> 'phone conf on August 5...) and I'm trying to bulk insert a humongous 
>>>>>>>>> amount of data: 1 million vertices and 1 billion edges.
>>>>>>>>>
>>>>>>>>> Firstly, I'm impressed about how easy it was to configure a 
>>>>>>>>> cluster. However, the performance of batch inserting is bad (and 
>>>>>>>>> seems to 
>>>>>>>>> get considerably worse as I add more data). It starts at about 2k 
>>>>>>>>> vertices-and-edges per second and deteriorates to about 500/second 
>>>>>>>>> after 
>>>>>>>>> only about 3 million edges have been added. This also takes ~ 30 
>>>>>>>>> minutes. 
>>>>>>>>> Needless to say that 1 billion payments (edges) will take over a week 
>>>>>>>>> at 
>>>>>>>>> this rate. 
>>>>>>>>>
>>>>>>>>> This is a show-stopper for us.
>>>>>>>>>
>>>>>>>>> My data model is simply payments between accounts and I store it 
>>>>>>>>> in one large file. It's just 3 fields and looks like:
>>>>>>>>>
>>>>>>>>> FROM_ACCOUNT TO_ACCOUNT AMOUNT
>>>>>>>>>
>>>>>>>>> In the test data I generated, I had 1 million accounts and 1 
>>>>>>>>> billion payments randomly distributed between pairs of accounts.
>>>>>>>>>
>>>>>>>>> I have 2 classes in OrientDB: ACCOUNTS (extending V) and PAYMENT 
>>>>>>>>> (extending E). There is a UNIQUE_HASH_INDEX on ACCOUNTS for the 
>>>>>>>>> account 
>>>>>>>>> number (a string).
>>>>>>>>>
>>>>>>>>> We're using OrientDB 2.2.7.
>>>>>>>>>
>>>>>>>>> My batch size is 5k and I am using the "remote" protocol to 
>>>>>>>>> connect to our cluster.
>>>>>>>>>
>>>>>>>>> I'm using JDK 8 and my 3 boxes are beefy machines (32 cores each) 
>>>>>>>>> but without SSDs. I wrote the importing code myself but did nothing 
>>>>>>>>> 'clever' (I think) and used the Graph API. This client code has been 
>>>>>>>>> given 
>>>>>>>>> lots of memory and using jstat I can see it is not excessively GCing.
>>>>>>>>>
>>>>>>>>> So, my questions are:
>>>>>>>>>
>>>>>>>>> 1. what kind of performance can I realistically expect and can I 
>>>>>>>>> improve what I have at the moment?
>>>>>>>>>
>>>>>>>>> 2. what kind of degradation should I expect as the graph grows?
>>>>>>>>>
>>>>>>>>> Thanks, guys.
>>>>>>>>>
>>>>>>>>> Phillip
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> -- 
>>>>>>>>
>>>>>>>> --- 
>>>>>>>> You received this message because you are subscribed to the Google 
>>>>>>>> Groups "OrientDB" group.
>>>>>>>> To unsubscribe from this group and stop receiving emails from it, 
>>>>>>>> send an email to orient-databa...@googlegroups.com.
>>>>>>>> For more options, visit https://groups.google.com/d/optout.
>>>>>>>>
>>>>>>>
>>>>>>> -- 
>>>>>>
>>>>>> --- 
>>>>>> You received this message because you are subscribed to the Google 
>>>>>> Groups "OrientDB" group.
>>>>>> To unsubscribe from this group and stop receiving emails from it, 
>>>>>> send an email to orient-databa...@googlegroups.com.
>>>>>> For more options, visit https://groups.google.com/d/optout.
>>>>>>
>>>>>
>>>>> -- 
>>>>
>>>> --- 
>>>> You received this message because you are subscribed to the Google 
>>>> Groups "OrientDB" group.
>>>> To unsubscribe from this group and stop receiving emails from it, send 
>>>> an email to orient-databa...@googlegroups.com.
>>>> For more options, visit https://groups.google.com/d/optout.
>>>>
>>>
>>> -- 
>>
>> --- 
>> You received this message because you are subscribed to the Google Groups 
>> "OrientDB" group.
>> To unsubscribe from this group and stop receiving emails from it, send an 
>> email to orient-databa...@googlegroups.com <javascript:>.
>> For more options, visit https://groups.google.com/d/optout.
>>
> -- 
> Best regards,
> Andrey Lomakin, R&D lead. 
> OrientDB Ltd
>
> twitter: @Andrey_Lomakin 
> linkedin: https://ua.linkedin.com/in/andreylomakin
> blogger: http://andreylomakin.blogspot.com/ 
>

-- 

--- 
You received this message because you are subscribed to the Google Groups 
"OrientDB" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to orient-database+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Re: [orientdb] Re: Performance of Distributed (3 nodes) cluster with one billion edges

Reply via email to