I spent a little bit of additional time over the past few weeks trying 
different variants of this basic setup with little success in improving the 
performance.  I'm posting my results in case anyone else comes along later 
looking for posts of this subject.

In an attempt to see if using the networked EBS drives were the bottleneck 
I ran on a single r3.large instance and basically saw the same throughput 
performance numbers across the various vertices and edges I was attempting 
to load. When I switched to using plocal vs remote, I saw approximately a 
7X performance increase in the loading the vertices.  Unfortunately, in our 
envisioned scenario, running in plocal mode is likely not feasible.

Loading the edges was a different story all together.  Based on our data 
flows, we were in a position where the edges were extracted from our data 
separately from the vertices, so we had to load them up after populating 
the vertex nodes in the DB. As a result, I assume the ETL loader had to run 
2 queries (to convert our native record ids into RIDs) before being able to 
actually add the edge to the graph. Running version 2.2.0 of the software 
had the ETL tool throwing errors while processing our file (a move to 2.2.2 
eventually solved the issue). When we were finally able to run the files 
successfully, we were seeing throughput in the rand of @ 150 edges/sec 
(running with one thread). We also wrote a simple Apache Spark driver 
program using the Java Graph API and were able to start running parallel 
record streams and get to a load rate of approximately 1,000/edges/sec 
before we started having errors show up in our loader. 



On Wednesday, June 8, 2016 at 10:34:30 AM UTC-4, Curt Kohler wrote:
>
> I've been asked to kick the tires on OrientDB as a possible graph DB 
> solution for an upcoming project at my company. In order to do so, I've 
> spun up an EC2 instance using the OrientDB marketplace AMI on a m4.xlarge 
> box with an EBS drive (picked as a general purpose box since I couldn't 
> find any hardware recommendations via documentation or searches).  I've got 
> the database running, but when I try and use the ETL bulk import tools with 
> CSV files, I'm seeing what I consider very poor performance compared to the 
> claims I have read. The best I've seen is @ 5K records/second loaded. The 
> existing documentation leaves a bit to be desired, so I was hoping someone 
> might be able to offer some insight.
>
> Here are some details (I've scaled things back trying to understand where 
> I may have gone wrong).
>
>
>    - One file 2 million records that has two columns  (record key and 
>    text field).  E.g.  ABC\tString here
>    - A class schema was predefined outside the ETL config script with 
>    those two fields and an index on the id field
>    - This ETL script - based on one in the documentation -  I am running 
>    on the EC2 box  (I am using remote: connection as the project will consist 
>    of a distributed DB. even though both are on the same box right now)
>    
> {
>     "source":{
>         "file":{
>             "path":"/user/poc1_Datasets/organization.tsv"
>         }
>     },
>     "extractor":{
>         "row":{
>
>         }
>     },
>     "transformers":[
>         {
>             "csv":{
>                "separator": "\t"
>             }
>         },
>         {
>             "vertex":{
>                 "class":"Organization"
>             }
>         },
>     ],
>     "loader":{
>         "orientdb":{
>             "dbURL":"remote:localhost/DataSpine1",
>             "dbType":"graph",
>             "wal":false,
>             "tx":false,
>             "batchCommit":25000
>         }
>     }
> }
>
> The final output of the ETL loader in this case was:
>
> END ETL PROCESSOR
> + extracted 1,822,150 rows (3,904 rows/sec) - 1,822,150 rows -> loaded 
> 1,822,149 vertices (3,907 vertices/sec) Total time: 520411ms [0 warnings, 0 
> errors]
>
> Does using the remote: protocol really kill performance that greatly?  I 
> believe the AMI has configured the data to be sitting on the EBS drive. 
> Should I try and find an instance that would leverage the local ephemeral?
>
> Any insights you could provide would be appreciated.
>
> Curt
>

-- 

--- 
You received this message because you are subscribed to the Google Groups 
"OrientDB" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
For more options, visit https://groups.google.com/d/optout.

Reply via email to