Re: [orientdb] Re: Performance of Distributed (3 nodes) cluster with one billion edges

Luca Garulli Fri, 23 Sep 2016 08:07:06 -0700

On 23 September 2016 at 03:50, Phillip Henry <[email protected]> wrote:


> > How big is your file the sort cannot write?
>
> One bil-ee-on lines... :-P
>

How many GB?


> > ...This should help a lot.
>
> The trouble is that the size of a block of contiguous accounts in the real
> data is not-uniform (even if it might be with my test data). Therefore, it
> is highly likely a contiguous block of account numbers will span 2 or more
> batches. This will lead to a lot of contention. In your example, if Account
> 2 spills over into the next batch, chances are I'll have to rollback that
> batch.
>
> Don't you also have a problem that if X, Y, Z and W in your example are
> account numbers in the next batch, you'll also get contention? Admittedly,
> randomization doesn't solve this problem either.
>

If the file is ordered, you could have X threads (where X is the number of
cores) that parse the file not sequentially. For example with 4 threads,
you could start the parsing in this way:

Thread 1, starts from 0
Thread 2, starts from length * 1/4
Thread 3, starts from length * 2/4
Thread 1, starts from length * 3/4

Of course the parsing should browse until the next LF+LR if it's a CSV. It
requires some lines of code, but you could avoid many conflicts.


> > you can use the special Batch Importer: OGraphBatchInsert
>
> Would this not be subject to the same contention problems?
> At what point is it flushed to disk? (Obviously, it can't live in heap
> forever).
>

It keeps everything in RAM before flushing. Up to a few hundreds of
millions of vertices/edges should be fine if you have a lot of heap, like
58GB (and 4GB of DISKCACHE). It depends by the number of attributes you
have.


> > You should definitely using transactions with batch size of 100 items.
>
> I thought I read somewhere else (can't find the link at the moment) that
> you said only use transactions when using the remote protocol?
>

This was true before v2.2. With v2.2 the management of the transaction is
parallel and very light. Transactions work well with graphs because every
addEdge() operation is 2 update and having a TX that works like a batch
really helps.


>
> > Please use last 2.2.10. ... try to define 50GB of DISKCACHE and 14GB of
> Heap
>
> Will do on the next run.
>
> > If happens again, could you please send a thread dump?
>
> I have the full thread dump but it's on my work machine so can't post it
> in this forum (all access to Google Groups is banned by the bank so I am
> writing this on my personal computer). Happy to email them to you. Which
> email shall I use?
>

You cna use support --at- orientdb.com referring at this thread in the
subject.


>
> Phill
>


Best Regards,

Luca Garulli
Founder & CEO
OrientDB LTD <http://orientdb.com/>

Want to share your opinion about OrientDB?
Rate & review us at Gartner's Software Review
<https://www.gartner.com/reviews/survey/home>



> On Friday, September 23, 2016 at 7:41:29 AM UTC+1, l.garulli wrote:
>
>> On 23 September 2016 at 00:49, Phillip Henry <[email protected]> wrote:
>>
>>> Hi, Luca.
>>>
>>
>> Hi Phillip.
>>
>>
>>> I have:
>>>
>>> 4. sorting is an overhead, albeit outside of Orient. Using the Unix sort
>>> command failed with "No space left on device". Oops. OK, so I ran my
>>> program to generate the data again, this time it is ordered by the first
>>> account number. Performance was much slower as there appeared to be a lot
>>> of contention for this account (ie, all writes were contending for this
>>> account, even if the other account had less contention). More randomized
>>> data was faster.
>>>
>>
>> How big is your file the sort cannot write? Anyway, if you have the
>> accounts sorted, you should have transactions of about 100 items where the
>> bank account and edges are in the same transaction. This should help a lot.
>> Example:
>>
>> Account 1 -> Payment 1 -> Account X
>> Account 1 -> Payment 2 -> Account Y
>> Account 1 -> Payment 3 -> Account Z
>> Account 2 -> Payment 1 -> Account X
>> Account 2 -> Payment 1 -> Account W
>>
>> If the transaction batch is 5 (I suggest you to start with 100), all the
>> operations are executed in one transaction. In another thread has:
>>
>> Account 99 -> Payment 1 -> Account W
>>
>> It could go in conflict because the shared Account W.
>>
>> If you can export Account's IDs that are numbers and incremental, you can
>> use the special Batch Importer: OGraphBatchInsert. Example:
>>
>> OGraphBatchInsert batch = new OGraphBatchInsert("plocal:/temp/mydb", 
>> "admin", "admin");
>> batch.begin();
>>
>> batch.createEdge(0L, 1L, null); // CREATE EDGES BETWEEN VERTEX 0 and 1. IF 
>> VERTICES
>>
>>                                 // DON'T EXISTS, ARE CREATED IMPLICITELY
>> batch.createEdge(1L, 2L, null);
>> batch.createEdge(2L, 0L, null);
>>
>>
>> batch.createVertex(3L); // CREATE AN NON CONNECTED VERTEX
>>
>>
>> Map<String, Object> vertexProps = new HashMap<String, Object>();
>> vertexProps.put("foo", "foo");
>> vertexProps.put("bar", 3);
>> batch.setVertexProperties(0L, vertexProps); // SET PROPERTY FOR VERTEX 0
>> batch.end();
>>
>> This is blazing fast, but uses Heap so run it with a lot of it.
>>
>>
>>>
>>> 6. I've mutlithreaded my loader. The details are now:
>>>
>>> - using plocal
>>> - using 30 threads
>>> - not using transactions (OrientGraphFactory.getNoTx)
>>>
>>
>> You should definitely using transactions with batch size of 100 items.
>> This speeds up things.
>>
>>
>>> - retrying forever upon write collisions.
>>> - using Orient 2.2.7.
>>>
>>
>> Please use last 2.2.10.
>>
>>
>>> - using -XX:MaxDirectMemorySize:258040m
>>>
>>
>> This is not really important, it's just an upper bound for the JVM.
>> Please set it to 512GB so you can forget about it. The 2 most important
>> values are DISKCACHE and JVM heap. The sum must lower than the available
>> RAM in the server before you run OrientDB.
>>
>> If you have 64GB, try to define 50GB of DISKCACHE and 14GB of Heap.
>>
>> If you use the Batch Importer, you should use more Heap and less
>> DISKCACHE.
>>
>>
>>> The good news is I've achieved an initial write throughput of about
>>> 30k/second.
>>>
>>> The bad news is I've tried several runs and only been able to achieve
>>> 200mil < number of writes < 300mil.
>>>
>>> The first time I tried it, the loader deadlocked. Using jstat showed
>>> that the deadlock was between 3 threads at:
>>> - OOneKeyEntryPerKeyLockManager.acquireLock(OOneKeyEntryPerKey
>>> LockManager.java:173)
>>> - OPartitionedLockManager.acquireExclusiveLock(OPartitionedLoc
>>> kManager.java:210)
>>> - OOneKeyEntryPerKeyLockManager.acquireLock(OOneKeyEntryPerKey
>>> LockManager.java:171)
>>>
>>
>> If happens again, could you please send a thread dump?
>>
>>
>>> The second time it failed was due to a NullPointerException at
>>> OByteBufferPool.java:297. I've looked at the code and the only way I can
>>> see this happening is if OByteBufferPool.allocateBuffer throws an error
>>> (perhaps an OutOfMemoryError in java.nio.Bits.reserveMemory). This
>>> StackOverflow posting (http://stackoverflow.com/ques
>>> tions/8462200/examples-of-forcing-freeing-of-native-memory-
>>> direct-bytebuffer-has-allocated-us) seems to indicate that this can
>>> happen if the underlying DirectByteBuffer's Cleaner doesn't have its
>>> clean() method called.
>>>
>>
>> This is because the database was bigger than this setting: - using
>> -XX:MaxDirectMemorySize:258040m. Please set this at 512GB (see above).
>>
>>
>>> Alternatively, I followed the SO suggestion and lowered the heap space
>>> to a mere 1gb (it was 50gb) to make the GC more active. Unfortunately,
>>> after a good start, the job is still running some 15 hours later with a
>>> hugely reduced write throughput (~ 7k/s). Jstat shows 4292 full GCs taking
>>> a total time of 4597s - not great but not hugely awful either. At this
>>> rate, the remaining 700mil or so payments are going to take another 30
>>> hours.
>>>
>>
>> See above the suggested settings.
>>
>>
>>> 7. Even with the highest throughput I have achieved, 30k writes per
>>> second, I'm looking at about 20 hours of loading. We've taken the same data
>>> and, after trial and error that was not without its own problems, put it
>>> into Neo4J in 37 minutes. This is a significant difference. It appears that
>>> they are approaching the problem differently to avoid contention on
>>> updating the vertices during an edge write.
>>>
>>
>> With all this suggestion you should be able to have much better numbers.
>> If you can use the Batch Importer the number should be close to Neo4j.
>>
>>
>>>
>>> Thoughts?
>>>
>>> Regards,
>>>
>>> Phillip
>>>
>>>
>>
>> Best Regards,
>>
>> Luca Garulli
>> Founder & CEO
>> OrientDB LTD <http://orientdb.com/>
>>
>> Want to share your opinion about OrientDB?
>> Rate & review us at Gartner's Software Review
>> <https://www.gartner.com/reviews/survey/home>
>>
>>
>>
>>
>>>
>>> On Thursday, September 15, 2016 at 10:06:44 PM UTC+1, l.garulli wrote:
>>>>
>>>> On 15 September 2016 at 09:54, Phillip Henry <[email protected]>
>>>> wrote:
>>>>
>>>>> Hi, Luca.
>>>>>
>>>>
>>>> Hi Phillip,
>>>>
>>>> 3. Yes, default configuration. Apart from adding an index for ACCOUNTS,
>>>>> I did nothing further.
>>>>>
>>>>
>>>> Ok, so you have writeQuorum="majority" that means 2 sycnhronous writes
>>>> and 1 asynchronous per transaction.
>>>>
>>>>
>>>>> 4. Good question. With real data, we expect it to be as you suggest:
>>>>> some nodes with the majority of the payments (eg, supermarkets). However,
>>>>> for the test data, payments were assigned randomly and, therefore, should
>>>>> be uniformly distributed.
>>>>>
>>>>
>>>> What's your average in terms of number of edges? <10, <50, <200, <1000?
>>>>
>>>>
>>>>> 2. Yes, I tried plocal minutes after posting (d'oh!). I saw a good
>>>>> improvement. It started about 3 times faster and got faster still (about 
>>>>> 10
>>>>> times faster) by the time I checked this morning on a job running
>>>>> overnight. However, even though it is now running at about 7k transactions
>>>>> per second, a billion edges is still going to take about 40 hours. So, I
>>>>> ask myself: is there anyway I can make it faster still?
>>>>>
>>>>
>>>> Here it's missing the usage of AUTO-SHARDING INDEX. Example:
>>>>
>>>> accountClass.createIndex("Account.number", 
>>>> OClass.INDEX_TYPE.UNIQUE.toString(), (OProgressListener) null, (ODocument) 
>>>> null,
>>>>     "AUTOSHARDING", new String[] { "number" });
>>>>
>>>> In this way you should go more in parallel, because the index is
>>>> distributed across all the shards (clusters) of Account class. you should
>>>> have 32 of them by default because you have 32 cores.
>>>>
>>>> Please let me know if by sorting the from_accounts and with this change
>>>> if it's much faster.
>>>>
>>>> This is the best you can have out of the box. To push numbers up it's
>>>> slightly more complicated: you should be sure that transactions go in
>>>> parallel and they aren't serialized. This is possible by playing with
>>>> internal OrientDB settings (mainly the distributed workerThreads), by
>>>> having many clusters per class (You could try with 128 first and see how
>>>> it's going).
>>>>
>>>>
>>>>> I assume when I start the servers up in distributed mode once more,
>>>>> the data will then be distributed across all nodes in the cluster?
>>>>>
>>>>
>>>> That's right.
>>>>
>>>>
>>>>> 3. I'll return to concurrent, remote inserts when this job has
>>>>> finished. Hopefully, a smaller batch size will mean there is no 
>>>>> degradation
>>>>> in performance either... FYI: with a somewhat unscientific approach, I was
>>>>> polling the server JVM with JStack and saw only a single thread doing all
>>>>> the work and it *seemed* to spend a lot of its time in ODirtyManager on
>>>>> collection manipulation.
>>>>>
>>>>
>>>> I think it's because you didn't use the AUTO-SHARDING index.
>>>> Furthermore running distributed, unfortunately, means the tree ridbag is
>>>> not available (we will support it in the future), so every change to the
>>>> edges takes a lot of CPU to demarshall and marshall the entire edge list
>>>> everytime you update a vertex. That's why my recommendation about sorting
>>>> the vertices.
>>>>
>>>>
>>>>> I totally appreciate that performance tuning is an empirical science,
>>>>> but do you have any opinions as to which would probably be faster:
>>>>> single-threaded plocal or multithreaded remote?
>>>>>
>>>>
>>>> With v2.2 yo can go in parallel, by using the tips above. For sure the
>>>> replication has a cost. I'm sure you can go much faster with just one node
>>>> and then start the other 2 nodes to have the database replicated
>>>> automatically. At least for the first massive insertion.
>>>>
>>>>
>>>>>
>>>>> Regards,
>>>>>
>>>>> Phillip
>>>>>
>>>>
>>>> Luca
>>>>
>>>>
>>>>
>>>>>
>>>>> On Wednesday, September 14, 2016 at 3:48:56 PM UTC+1, Phillip Henry
>>>>> wrote:
>>>>>>
>>>>>> Hi, guys.
>>>>>>
>>>>>> I'm conducting a proof-of-concept for a large bank (Luca, we had a
>>>>>> 'phone conf on August 5...) and I'm trying to bulk insert a humongous
>>>>>> amount of data: 1 million vertices and 1 billion edges.
>>>>>>
>>>>>> Firstly, I'm impressed about how easy it was to configure a cluster.
>>>>>> However, the performance of batch inserting is bad (and seems to get
>>>>>> considerably worse as I add more data). It starts at about 2k
>>>>>> vertices-and-edges per second and deteriorates to about 500/second after
>>>>>> only about 3 million edges have been added. This also takes ~ 30 minutes.
>>>>>> Needless to say that 1 billion payments (edges) will take over a week at
>>>>>> this rate.
>>>>>>
>>>>>> This is a show-stopper for us.
>>>>>>
>>>>>> My data model is simply payments between accounts and I store it in
>>>>>> one large file. It's just 3 fields and looks like:
>>>>>>
>>>>>> FROM_ACCOUNT TO_ACCOUNT AMOUNT
>>>>>>
>>>>>> In the test data I generated, I had 1 million accounts and 1 billion
>>>>>> payments randomly distributed between pairs of accounts.
>>>>>>
>>>>>> I have 2 classes in OrientDB: ACCOUNTS (extending V) and PAYMENT
>>>>>> (extending E). There is a UNIQUE_HASH_INDEX on ACCOUNTS for the account
>>>>>> number (a string).
>>>>>>
>>>>>> We're using OrientDB 2.2.7.
>>>>>>
>>>>>> My batch size is 5k and I am using the "remote" protocol to connect
>>>>>> to our cluster.
>>>>>>
>>>>>> I'm using JDK 8 and my 3 boxes are beefy machines (32 cores each) but
>>>>>> without SSDs. I wrote the importing code myself but did nothing 'clever' 
>>>>>> (I
>>>>>> think) and used the Graph API. This client code has been given lots of
>>>>>> memory and using jstat I can see it is not excessively GCing.
>>>>>>
>>>>>> So, my questions are:
>>>>>>
>>>>>> 1. what kind of performance can I realistically expect and can I
>>>>>> improve what I have at the moment?
>>>>>>
>>>>>> 2. what kind of degradation should I expect as the graph grows?
>>>>>>
>>>>>> Thanks, guys.
>>>>>>
>>>>>> Phillip
>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>
>>>>> ---
>>>>> You received this message because you are subscribed to the Google
>>>>> Groups "OrientDB" group.
>>>>> To unsubscribe from this group and stop receiving emails from it, send
>>>>> an email to [email protected].
>>>>> For more options, visit https://groups.google.com/d/optout.
>>>>>
>>>>
>>>> --
>>>
>>> ---
>>> You received this message because you are subscribed to the Google
>>> Groups "OrientDB" group.
>>> To unsubscribe from this group and stop receiving emails from it, send
>>> an email to [email protected].
>>> For more options, visit https://groups.google.com/d/optout.
>>>
>>
>> --
>
> ---
> You received this message because you are subscribed to the Google Groups
> "OrientDB" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to [email protected].
> For more options, visit https://groups.google.com/d/optout.
>

-- 

--- 
You received this message because you are subscribed to the Google Groups 
"OrientDB" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
For more options, visit https://groups.google.com/d/optout.

Re: [orientdb] Re: Performance of Distributed (3 nodes) cluster with one billion edges

Reply via email to