I wrote a command line utility in Groovy that would do this for Titan -
here's how it worked:
1) Either a file or directory path for vertices/edges was passed.
2) Optional regex for extracting the vertex label from the file name(s).
- Default is to split on underscores/dash/whitespace and take
element [0] (the label in the file would give more flexibility).
- These files are batched according to available processors.
- A transaction was opened to load each file from each batch.
3) Vertices - 1st column as id property, remaining additional props.
- Should just be selection of the desired named/positional column.
- The user should be able to provide an id mapping file:
a. Restricts ids they care to load by the mapped-to ids.
b. Shows coverage of their intended id conversion over file lines.
4) Edges - 1st column v1 id, 2nd label, 3rd v2 id, rest edge props.
- Should also be generalized to selection/configuration by user.
5) Type - append after colon to the column header e.g. "name:int"
- Type is often inferred from the first hundred lines of the file.
- But when inconsistencies are further along than that, ugh.
I never managed to get the "interactive" part working before I moved
on from this, but I think it's essential as the user should not have to
hack on the CSV data much to get it to load. My idea was displaying
the file headers, getting the user to mark which has the "identifier"
(for Titan was just a property key under a unique index), asking them
if they have a map file for that identifier, and finally asking them to
confirm the types we inferred based on the first 100 lines or a sampling
of lines or whatever with an option to "just do it already". Then, if the
user is trying to load a gazillion CSV files from a directory or set of
directories, we just ask them for "profiles" like this to apply per directory,
per file name matching some regex or criteria about its n x m column
shape, or something else to distinguish multiple files from each other.
Same general thing applies to edges. Of course, all this should be
possible to tuck away in a configuration file, or provide as arguments
to a "builder" in the REPL somehow - I think that could get confusing
fast, but with similar hand-holding to the above it could be workable.
For parsing the file, I think it needs reasonable defaults but like most
CSV parsing frameworks, provide the option to change the quote
character, line terminator, delimiter, skip n lines at the front, n lines at
the back, and all that stuff.
Hope that helps somewhat - sorry for the spam if this could have gone
unsaid.
________________________________________
From: Stephen Mallette <[email protected]>
Sent: Wednesday, December 2, 2015 6:55 AM
To: [email protected]
Subject: Re: [DISCUSS] Add native CSV loading support for gremlin (GraphReader)
Thanks for bringing this up for discussion and offering to work on it. You
don't make mention of how you will deal with data types - will you have
some way to give users some fine-grained control of that?
On Tue, Dec 1, 2015 at 10:46 AM, Alaa Mahmoud <[email protected]> wrote:
> Adding support for loading CSV into a graph using Gremlin's GraphReader
> will lower the entry barrier for new users. A lot of data is already in CSV
> format and a lot of existing databases/repositories allow users to export
> their data as CSV.
>
> I'd like to add this capability to the gremlin core as a new GraphReader
> instance. Since the CSV data doesn't map directly to nodes and vertexes,
> I'm planning to do the loading on two steps:
>
> *Nodes*
> The first is to load a CSV as vertex CSV file. I'll create a node for every
> line in the csv and a property for each column on that line. If the csv has
> column headers, then the names of the columns will be the names of the
> corresponding vertex property. Otherwise, It'll be prop1, prop2, etc...
> (There are other ways to do it as well, but I'm just trying to show the
> general idea)
>
> *Edges*
> The second step is loading the edges csv file which will be in the
> following format
>
> vertex1 prop name (source vertex), vertex2 prop name (destination vertex),
> bidirectional (TRUE/FALSE), prop1,prop2,prop3,etc...
>
> For each line in the edge csv file, the reader will search for a vertex
> with the vertex1 prop value (caller need to ensure it's unique) to find the
> source vertex, search for a destination vertex with destination prop value
> and then create an edge that ties the two together. We will be creating an
> edge property for each additional property on the line.
>
> Thoughts?
>
> Alaa
>