Re: [DISCUSS] Add native CSV loading support for gremlin (GraphReader)

Stephen Mallette Wed, 02 Dec 2015 09:08:14 -0800

Dylan - thanks for your input. What you said actually gets at the direction
I was heading when I asked about how "types" would be handled and
underscores what I perceive is a greater level of complexity for this task
than is handled well in the standard GraphReader/Writer interfaces.


On Wed, Dec 2, 2015 at 11:48 AM, Dylan Bethune-Waddell <
[email protected]> wrote:

> I wrote a command line utility in Groovy that would do this for Titan -
> here's how it worked:
>
> 1) Either a file or directory path for vertices/edges was passed.
> 2) Optional regex for extracting the vertex label from the file name(s).
>     - Default is to split on underscores/dash/whitespace and take
>       element [0] (the label in the file would give more flexibility).
>     - These files are batched according to available processors.
>     - A transaction was opened to load each file from each batch.
> 3) Vertices - 1st column as id property, remaining additional props.
>     - Should just be selection of the desired named/positional column.
>     - The user should be able to provide an id mapping file:
>        a. Restricts ids they care to load by the mapped-to ids.
>        b. Shows coverage of their intended id conversion over file lines.
> 4) Edges - 1st column v1 id, 2nd label, 3rd v2 id, rest edge props.
>     - Should also be generalized to selection/configuration by user.
> 5) Type - append after colon to the column header e.g. "name:int"
>     - Type is often inferred from the first hundred lines of the file.
>     - But when inconsistencies are further along than that, ugh.
>
> I never managed to get the "interactive" part working before I moved
> on from this, but I think it's essential as the user should not have to
> hack on the CSV data much to get it to load. My idea was displaying
> the file headers, getting the user to mark which has the "identifier"
> (for Titan was just a property key under a unique index), asking them
> if they have a map file for that identifier, and finally asking them to
> confirm the types we inferred based on the first 100 lines or a sampling
> of lines or whatever with an option to "just do it already". Then, if the
> user is trying to load a gazillion CSV files from a directory or set of
> directories, we just ask them for "profiles" like this to apply per
> directory,
> per file name matching some regex or criteria about its n x m column
> shape, or something else to distinguish multiple files from each other.
> Same general thing applies to edges. Of course, all this should be
> possible to tuck away in a configuration file, or provide as arguments
> to a "builder" in the REPL somehow - I think that could get confusing
> fast, but with similar hand-holding to the above it could be workable.
>
> For parsing the file, I think it needs reasonable defaults but like most
> CSV parsing frameworks, provide the option to change the quote
> character, line terminator, delimiter, skip n lines at the front, n lines
> at
> the back, and all that stuff.
>
> Hope that helps somewhat - sorry for the spam if this could have gone
> unsaid.
>
> ________________________________________
> From: Stephen Mallette <[email protected]>
> Sent: Wednesday, December 2, 2015 6:55 AM
> To: [email protected]
> Subject: Re: [DISCUSS] Add native CSV loading support for gremlin
> (GraphReader)
>
> Thanks for bringing this up for discussion and offering to work on it. You
> don't make mention of how you will deal with data types - will you have
> some way to give users some fine-grained control of that?
>
>
>
>
>
> On Tue, Dec 1, 2015 at 10:46 AM, Alaa Mahmoud <[email protected]>
> wrote:
>
> > Adding support for loading CSV into a graph using Gremlin's GraphReader
> > will lower the entry barrier for new users. A lot of data is already in
> CSV
> > format and a lot of existing databases/repositories allow users to export
> > their data as CSV.
> >
> > I'd like to add this capability to the gremlin core as a new GraphReader
> > instance. Since the CSV data doesn't map directly to nodes and vertexes,
> > I'm planning to do the loading on two steps:
> >
> > *Nodes*
> > The first is to load a CSV as vertex CSV file. I'll create a node for
> every
> > line in the csv and a property for each column on that line. If the csv
> has
> > column headers, then the names of the columns will be the names of the
> > corresponding vertex property. Otherwise, It'll be prop1, prop2, etc...
> > (There are other ways to do it as well, but I'm just trying to show the
> > general idea)
> >
> > *Edges*
> > The second step is loading the edges csv file which will be in the
> > following format
> >
> > vertex1 prop name (source vertex), vertex2 prop name (destination
> vertex),
> > bidirectional (TRUE/FALSE), prop1,prop2,prop3,etc...
> >
> > For each line in the edge csv file, the reader will search for a vertex
> > with the vertex1 prop value (caller need to ensure it's unique) to find
> the
> > source vertex, search for a destination vertex with destination prop
> value
> > and then create an edge that ties the two together. We will be creating
> an
> > edge property for each additional property on the line.
> >
> > Thoughts?
> >
> > Alaa
> >
>

Re: [DISCUSS] Add native CSV loading support for gremlin (GraphReader)

Reply via email to