[orientdb] ETL: how to import a huge csv and generate different vertexes/edges based on the csv column values

Lars Plessmann Wed, 21 Jan 2015 11:18:02 -0800

I have a really huge CSV (about 240GB) file with several columns (lets say 
there are columns A - H). 
The first column A is the primary key of the main record (vertex 
MainRecord). But the columns D, E, F, G are columns which should be stored 
in an own vertex (because these fields are redundant over all the records 
and I dont want to store them in the main record again and again). So the 
column value of D-G itself should be stored as a property called "title" in 
a new vertex (but it should not generate duplicates). Afterwards these 
vertexes needs to be linked.
Is this possible to reach this with an single orient-etl configuration? I 
think the only way I know is to split the huge csv file's columns and 
create sepperate files for each vertex. But I dont want to do this if that 
is not neccessairy (file is so big).
I hope you can give me an advice?


I try to describe it in the config json syntax what I need (of course, this 
will not work):

{
  "source": {
    "file": {
      "path": "dataexport.csv"
    }
  },
  "extractor": {"row": {}},
  "transformers": [
    {
      "csv": {
        "separator": ",",
        "nullValue": "NULL",
        "skipFrom": -1,
        "skipTo": -1
      }
    },
    {
      "field": {
        "fieldName": "_id",
        "expression": "$input._id.substring(9, 33)"
      }
    },
    {
      "field": {
        "fieldName": "colD",
        "class": "ColumnD",
        "classProperty": "title"
      }
    },
    {
      "field": {
        "fieldName": "colE",
        "class": "ColumnE",
        "classProperty": "title"
      }
    },
    {
      "field": {
        "fieldName": "colF",
        "class": "ColumnF",
        "classProperty": "title"
    }    {
      "field": {
        "fieldName": "colG",
        "class": "ColumnG",
        "classProperty": "title"
      }
    }
    },
    {
      "vertex": {"class": "MainRecord"}
    }
  ],
  "loader": {
    "orientdb": {
      "dbURL": "remote:127.0.0.1/msales_testing",
      "dbUser": "admin",
      "dbPassword": "admin",
      "dbAutoCreate": true,
      "dbType": "graph",
      "classes": [
        {
          "name": "MainRecord",
          "extends": "V"
        },
        {
          "name": "ColumnD",
          "extends": "V"
        },
        {
          "name": "ColumnE",
          "extends": "V"
        },
        {
          "name": "ColumnF",
          "extends": "V"
        },
        {
          "name": "ColumnG",
          "extends": "V"
        }
      ],
      "indexes": [
        {
          "class": "MainRecord",
          "fields": ["_id:string"],
          "type": "UNIQUE"
        }
      ]
    }
  }
}



By the way: _id is in the MongoDB ObjectID format. I just want to store the 
original hex value, so I used the substring sql method to extract the hex 
id. Maybe there is a better way.


regards
Lars

-- 

--- 
You received this message because you are subscribed to the Google Groups 
"OrientDB" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to orient-database+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

[orientdb] ETL: how to import a huge csv and generate different vertexes/edges based on the csv column values

Reply via email to