I've been asked to kick the tires on OrientDB as a possible graph DB 
solution for an upcoming project at my company. In order to do so, I've 
spun up an EC2 instance using the OrientDB marketplace AMI on a m4.xlarge 
box with an EBS drive (picked as a general purpose box since I couldn't 
find any hardware recommendations via documentation or searches).  I've got 
the database running, but when I try and use the ETL bulk import tools with 
CSV files, I'm seeing what I consider very poor performance compared to the 
claims I have read. The best I've seen is @ 5K records/second loaded. The 
existing documentation leaves a bit to be desired, so I was hoping someone 
might be able to offer some insight.

Here are some details (I've scaled things back trying to understand where I 
may have gone wrong).


   - One file 2 million records that has two columns  (record key and text 
   field).  E.g.  ABC\tString here
   - A class schema was predefined outside the ETL config script with those 
   two fields and an index on the id field
   - This ETL script - based on one in the documentation -  I am running on 
   the EC2 box  (I am using remote: connection as the project will consist of 
   a distributed DB. even though both are on the same box right now)
   
{
    "source":{
        "file":{
            "path":"/user/poc1_Datasets/organization.tsv"
        }
    },
    "extractor":{
        "row":{

        }
    },
    "transformers":[
        {
            "csv":{
               "separator": "\t"
            }
        },
        {
            "vertex":{
                "class":"Organization"
            }
        },
    ],
    "loader":{
        "orientdb":{
            "dbURL":"remote:localhost/DataSpine1",
            "dbType":"graph",
            "wal":false,
            "tx":false,
            "batchCommit":25000
        }
    }
}

The final output of the ETL loader in this case was:

END ETL PROCESSOR
+ extracted 1,822,150 rows (3,904 rows/sec) - 1,822,150 rows -> loaded 
1,822,149 vertices (3,907 vertices/sec) Total time: 520411ms [0 warnings, 0 
errors]

Does using the remote: protocol really kill performance that greatly?  I 
believe the AMI has configured the data to be sitting on the EBS drive. 
Should I try and find an instance that would leverage the local ephemeral?

Any insights you could provide would be appreciated.

Curt

-- 

--- 
You received this message because you are subscribed to the Google Groups 
"OrientDB" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
For more options, visit https://groups.google.com/d/optout.

Reply via email to