Re: [DISCUSS] Add native CSV loading support for gremlin (GraphReader)

Stephen Mallette Fri, 04 Dec 2015 07:37:34 -0800

>
> How about we create and enhanced version of GraphReader that takes a schema
> and a parser parameters (separator character, column headers or not,
> encoding,etc...)?



I'd be against making that kind of change as Schema is not a first class
citizen in TinkerPop.  We haven't yet made that leap to include that and
I'd say any proposal to deal with that in IO would need to be dealt with in
the much broader terms of the whole TinkerPop ecosystem.  You can search
the lists for the various discussions that have been had on that if you are
interested.

On a separate note, I'm not sure if you've explained what your intent is,
but my personal opinion is that you should develop this capability in your
own repo and offer it to the community as a third-party IO library. Not
sure how others feel about it.

On Fri, Dec 4, 2015 at 10:17 AM, Alaa Mahmoud <[email protected]> wrote:

> Thanks Stephen and Dylan for your response. I was trying to work within
> what GraphReader currently offers which isn't idea. Ideally GraphReader
> will allow to pass a schema for the file being rather than having each
> instance figure out a way to do that. The less we require users to modify
> their data before using TP, the more successful we'll be.
>
> How about we create and enhanced version of GraphReader that takes a schema
> and a parser parameters (separator character, column headers or not,
> encoding,etc...)?
>
> Dylan, thanks for sharing how your tool works and it has the same essence
> as what I'm trying to do. I like the idea of having the types in the column
> header next to each column name.
>
> We can start with something simple and then enhance it as we go.
>
> Regards
>
> On Wed, Dec 2, 2015 at 11:48 AM, Dylan Bethune-Waddell <
> [email protected]> wrote:
>
> > I wrote a command line utility in Groovy that would do this for Titan -
> > here's how it worked:
> >
> > 1) Either a file or directory path for vertices/edges was passed.
> > 2) Optional regex for extracting the vertex label from the file name(s).
> >     - Default is to split on underscores/dash/whitespace and take
> >       element [0] (the label in the file would give more flexibility).
> >     - These files are batched according to available processors.
> >     - A transaction was opened to load each file from each batch.
> > 3) Vertices - 1st column as id property, remaining additional props.
> >     - Should just be selection of the desired named/positional column.
> >     - The user should be able to provide an id mapping file:
> >        a. Restricts ids they care to load by the mapped-to ids.
> >        b. Shows coverage of their intended id conversion over file lines.
> > 4) Edges - 1st column v1 id, 2nd label, 3rd v2 id, rest edge props.
> >     - Should also be generalized to selection/configuration by user.
> > 5) Type - append after colon to the column header e.g. "name:int"
> >     - Type is often inferred from the first hundred lines of the file.
> >     - But when inconsistencies are further along than that, ugh.
> >
> > I never managed to get the "interactive" part working before I moved
> > on from this, but I think it's essential as the user should not have to
> > hack on the CSV data much to get it to load. My idea was displaying
> > the file headers, getting the user to mark which has the "identifier"
> > (for Titan was just a property key under a unique index), asking them
> > if they have a map file for that identifier, and finally asking them to
> > confirm the types we inferred based on the first 100 lines or a sampling
> > of lines or whatever with an option to "just do it already". Then, if the
> > user is trying to load a gazillion CSV files from a directory or set of
> > directories, we just ask them for "profiles" like this to apply per
> > directory,
> > per file name matching some regex or criteria about its n x m column
> > shape, or something else to distinguish multiple files from each other.
> > Same general thing applies to edges. Of course, all this should be
> > possible to tuck away in a configuration file, or provide as arguments
> > to a "builder" in the REPL somehow - I think that could get confusing
> > fast, but with similar hand-holding to the above it could be workable.
> >
> > For parsing the file, I think it needs reasonable defaults but like most
> > CSV parsing frameworks, provide the option to change the quote
> > character, line terminator, delimiter, skip n lines at the front, n lines
> > at
> > the back, and all that stuff.
> >
> > Hope that helps somewhat - sorry for the spam if this could have gone
> > unsaid.
> >
> > ________________________________________
> > From: Stephen Mallette <[email protected]>
> > Sent: Wednesday, December 2, 2015 6:55 AM
> > To: [email protected]
> > Subject: Re: [DISCUSS] Add native CSV loading support for gremlin
> > (GraphReader)
> >
> > Thanks for bringing this up for discussion and offering to work on it.
> You
> > don't make mention of how you will deal with data types - will you have
> > some way to give users some fine-grained control of that?
> >
> >
> >
> >
> >
> > On Tue, Dec 1, 2015 at 10:46 AM, Alaa Mahmoud <[email protected]>
> > wrote:
> >
> > > Adding support for loading CSV into a graph using Gremlin's GraphReader
> > > will lower the entry barrier for new users. A lot of data is already in
> > CSV
> > > format and a lot of existing databases/repositories allow users to
> export
> > > their data as CSV.
> > >
> > > I'd like to add this capability to the gremlin core as a new
> GraphReader
> > > instance. Since the CSV data doesn't map directly to nodes and
> vertexes,
> > > I'm planning to do the loading on two steps:
> > >
> > > *Nodes*
> > > The first is to load a CSV as vertex CSV file. I'll create a node for
> > every
> > > line in the csv and a property for each column on that line. If the csv
> > has
> > > column headers, then the names of the columns will be the names of the
> > > corresponding vertex property. Otherwise, It'll be prop1, prop2, etc...
> > > (There are other ways to do it as well, but I'm just trying to show the
> > > general idea)
> > >
> > > *Edges*
> > > The second step is loading the edges csv file which will be in the
> > > following format
> > >
> > > vertex1 prop name (source vertex), vertex2 prop name (destination
> > vertex),
> > > bidirectional (TRUE/FALSE), prop1,prop2,prop3,etc...
> > >
> > > For each line in the edge csv file, the reader will search for a vertex
> > > with the vertex1 prop value (caller need to ensure it's unique) to find
> > the
> > > source vertex, search for a destination vertex with destination prop
> > value
> > > and then create an edge that ties the two together. We will be creating
> > an
> > > edge property for each additional property on the line.
> > >
> > > Thoughts?
> > >
> > > Alaa
> > >
> >
>

Re: [DISCUSS] Add native CSV loading support for gremlin (GraphReader)

Reply via email to