I currently used the ETL tool to insert a bunch of CSV data into OrientDB. 
The system
configuration i used for trial purpose is EC2 M3 large ( 7.5 GiB of memory, 
2 vCPUs, 32 GB of SSD-based local instance storage, 64-bit platform ).

The data i'm trying to upload is of the below format :

"101.186.130.130","527225725","233 djfnsdkj","0.119836317542" 
"125.143.534.148","112212983","1227 sdfsdfds","0.0465215171983" 
"103.149.957.752","112364761","1121 sdfsdfds","0.0938863016658" 
"103.190.245.128","785804692","6138 sdfsdfsd","0.117767539364"

The schema contains 2 node classes and one edge class. When i tried loading 
the data using the ETL tool in plocal option
the speed was only about 2300 rows  / second. The ETL configuration is 
mentioned below : 

{
  "source": { "file": { "path": 
"/home/ubuntu/labvolume1/orientdb/bin/0001_part_00" } },
  "extractor": { "csv": {"columnsOnFirstLine": false, 
"columns":["ip:string", "dpcb:string", "address:string", "prob:string"] } },
  "transformers": [
    { "merge": { "joinFieldName":"ip", "lookup":"IpAddress.ip" } },
    { "field":
  { "fieldName": "addr_key",
    "expression": "dpcb.append('_').append(address)"
  }
},{ "vertex": { "class": "IpAddress" } },
    { "edge": { "class": "Located",
                "joinFieldName": "addr_key",
                "lookup": "PhyLocation.loc",
                "direction": "out",
                "targetVertexFields": { "geo_address": "${input.address}", 
"dpcb_number": "${input.dpcb}"},
                "edgeFields": { "confidence": "${input.prob}" },
                "unresolvedLinkAction": "CREATE"
            }
        }
  ],
  "loader": {
    "orientdb": {
       "dbURL": 
"plocal:/home/ubuntu/labvolume1/orientdb/databases/Bulk_Transfer_Test1",
       "dbType": "graph",
       "dbUser": "admin",
       "dbPassword": "admin",
       "serverUser": "admin",
       "wal": false,
       "serverPassword":"admin",
       "classes": [
         {"name": "IpAddress", "extends": "V"},
         {"name": "PhyLocation", "extends": "V"},
         {"name": "Located", "extends": "E"}
       ], "indexes": [
         {"class":"IpAddress", "fields":["ip:string"], "type":"UNIQUE" },
         {"class":"PhyLocation", "fields":["loc:string"], "type":"UNIQUE" }
       ]
    }
  }
}


Then i separated the vertices into files and ran the ETL job for only the 
vertices, this time the speed is close to 12500 rows / second. This was 
reasonably fast
and this kind of works for me. ( When i removed indexes the speed almost 
doubled) The config i used was :

{
  "source": { "file": { "path": 
"/home/ubuntu/labvolume1/orientdb/bin/only_ip_05.csv" } },
  "extractor": { "csv": {"columnsOnFirstLine": false, 
"columns":["ip:string"] } },
  "transformers": [
    { "vertex": { "class": "IpAddress" } }],
  "loader": {
    "orientdb": {
       "dbURL": 
"plocal:/home/ubuntu/labvolume1/orientdb/databases/Bulk_Transfer_Test7",
       "dbType": "graph",
       "dbUser": "admin",
       "dbPassword": "admin",
       "serverUser": "admin",
       "wal": false,
       "serverPassword":"admin",
       "classes": [
         {"name": "IpAddress", "extends": "V"}
       ],
       "indexes": [
         {"class":"IpAddress", "fields":["ip:string"], "type":"UNIQUE" }
       ]
    }
  }
}


However when i then tried to insert the edges alone the speed became 
extremely slow  at 2200 rows / second. This turned out to be even lower 
than running the entire operation
within one run. The config file is attached  below :

{
  "source": { "file": { "path": 
"/home/ubuntu/labvolume1/orientdb/bin/edge5.csv" } },
  "extractor": { "csv": {"columnsOnFirstLine": false, 
"columns":["ip:string", "loc:string", "prob:string"] } },
  "transformers": [
    { "merge": { "joinFieldName":"ip", "lookup":"IpAddress.ip" } },
    { "vertex": { "class" : "IpAddress", "skipDuplicates" : true }},
    { "edge": { "class": "Located",
                "joinFieldName": "loc",
                "lookup": "PhyLocation.loc",
                "direction": "out",
                "edgeFields": { "confidence": "${input.prob}" },
                "unresolvedLinkAction": "NOTHING"
            }
        }
  ],
  "loader": {
    "orientdb": {
       "dbURL": 
"plocal:/home/ubuntu/labvolume1/orientdb/databases/Bulk_Transfer_Test7",
       "dbType": "graph",
       "dbUser": "admin",
       "dbPassword": "admin",
       "serverUser": "admin",
       "wal": false,
       "tx":false,
       "batchCommit":10000,
       "serverPassword":"admin",
       "classes": [
         {"name": "IpAddress", "extends": "V"},
         {"name": "PhyLocation", "extends": "V"},
         {"name": "Located", "extends": "E"}
       ]
    }
  }
}


Please can you let me know if i'm doing anything wrong here, Also please 
suggest better ways for performance improvement

-- 

--- 
You received this message because you are subscribed to the Google Groups 
"OrientDB" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
For more options, visit https://groups.google.com/d/optout.

Reply via email to