Re: [DISCUSS] Add native CSV loading support for gremlin (GraphReader)

Alaa Mahmoud Fri, 04 Dec 2015 07:41:39 -0800

My intent is to add a new instance of GraphReader (GraphCSVReader) to read
csv files but If the community feels it should be a 3rd party IO then
that's fine as well.


Alaa

On Fri, Dec 4, 2015 at 10:28 AM, Stephen Mallette <[email protected]>
wrote:

> >
> > How about we create and enhanced version of GraphReader that takes a
> schema
> > and a parser parameters (separator character, column headers or not,
> > encoding,etc...)?
>
>
> I'd be against making that kind of change as Schema is not a first class
> citizen in TinkerPop.  We haven't yet made that leap to include that and
> I'd say any proposal to deal with that in IO would need to be dealt with in
> the much broader terms of the whole TinkerPop ecosystem.  You can search
> the lists for the various discussions that have been had on that if you are
> interested.
>
> On a separate note, I'm not sure if you've explained what your intent is,
> but my personal opinion is that you should develop this capability in your
> own repo and offer it to the community as a third-party IO library. Not
> sure how others feel about it.
>
> On Fri, Dec 4, 2015 at 10:17 AM, Alaa Mahmoud <[email protected]>
> wrote:
>
> > Thanks Stephen and Dylan for your response. I was trying to work within
> > what GraphReader currently offers which isn't idea. Ideally GraphReader
> > will allow to pass a schema for the file being rather than having each
> > instance figure out a way to do that. The less we require users to modify
> > their data before using TP, the more successful we'll be.
> >
> > How about we create and enhanced version of GraphReader that takes a
> schema
> > and a parser parameters (separator character, column headers or not,
> > encoding,etc...)?
> >
> > Dylan, thanks for sharing how your tool works and it has the same essence
> > as what I'm trying to do. I like the idea of having the types in the
> column
> > header next to each column name.
> >
> > We can start with something simple and then enhance it as we go.
> >
> > Regards
> >
> > On Wed, Dec 2, 2015 at 11:48 AM, Dylan Bethune-Waddell <
> > [email protected]> wrote:
> >
> > > I wrote a command line utility in Groovy that would do this for Titan -
> > > here's how it worked:
> > >
> > > 1) Either a file or directory path for vertices/edges was passed.
> > > 2) Optional regex for extracting the vertex label from the file
> name(s).
> > >     - Default is to split on underscores/dash/whitespace and take
> > >       element [0] (the label in the file would give more flexibility).
> > >     - These files are batched according to available processors.
> > >     - A transaction was opened to load each file from each batch.
> > > 3) Vertices - 1st column as id property, remaining additional props.
> > >     - Should just be selection of the desired named/positional column.
> > >     - The user should be able to provide an id mapping file:
> > >        a. Restricts ids they care to load by the mapped-to ids.
> > >        b. Shows coverage of their intended id conversion over file
> lines.
> > > 4) Edges - 1st column v1 id, 2nd label, 3rd v2 id, rest edge props.
> > >     - Should also be generalized to selection/configuration by user.
> > > 5) Type - append after colon to the column header e.g. "name:int"
> > >     - Type is often inferred from the first hundred lines of the file.
> > >     - But when inconsistencies are further along than that, ugh.
> > >
> > > I never managed to get the "interactive" part working before I moved
> > > on from this, but I think it's essential as the user should not have to
> > > hack on the CSV data much to get it to load. My idea was displaying
> > > the file headers, getting the user to mark which has the "identifier"
> > > (for Titan was just a property key under a unique index), asking them
> > > if they have a map file for that identifier, and finally asking them to
> > > confirm the types we inferred based on the first 100 lines or a
> sampling
> > > of lines or whatever with an option to "just do it already". Then, if
> the
> > > user is trying to load a gazillion CSV files from a directory or set of
> > > directories, we just ask them for "profiles" like this to apply per
> > > directory,
> > > per file name matching some regex or criteria about its n x m column
> > > shape, or something else to distinguish multiple files from each other.
> > > Same general thing applies to edges. Of course, all this should be
> > > possible to tuck away in a configuration file, or provide as arguments
> > > to a "builder" in the REPL somehow - I think that could get confusing
> > > fast, but with similar hand-holding to the above it could be workable.
> > >
> > > For parsing the file, I think it needs reasonable defaults but like
> most
> > > CSV parsing frameworks, provide the option to change the quote
> > > character, line terminator, delimiter, skip n lines at the front, n
> lines
> > > at
> > > the back, and all that stuff.
> > >
> > > Hope that helps somewhat - sorry for the spam if this could have gone
> > > unsaid.
> > >
> > > ________________________________________
> > > From: Stephen Mallette <[email protected]>
> > > Sent: Wednesday, December 2, 2015 6:55 AM
> > > To: [email protected]
> > > Subject: Re: [DISCUSS] Add native CSV loading support for gremlin
> > > (GraphReader)
> > >
> > > Thanks for bringing this up for discussion and offering to work on it.
> > You
> > > don't make mention of how you will deal with data types - will you have
> > > some way to give users some fine-grained control of that?
> > >
> > >
> > >
> > >
> > >
> > > On Tue, Dec 1, 2015 at 10:46 AM, Alaa Mahmoud <[email protected]>
> > > wrote:
> > >
> > > > Adding support for loading CSV into a graph using Gremlin's
> GraphReader
> > > > will lower the entry barrier for new users. A lot of data is already
> in
> > > CSV
> > > > format and a lot of existing databases/repositories allow users to
> > export
> > > > their data as CSV.
> > > >
> > > > I'd like to add this capability to the gremlin core as a new
> > GraphReader
> > > > instance. Since the CSV data doesn't map directly to nodes and
> > vertexes,
> > > > I'm planning to do the loading on two steps:
> > > >
> > > > *Nodes*
> > > > The first is to load a CSV as vertex CSV file. I'll create a node for
> > > every
> > > > line in the csv and a property for each column on that line. If the
> csv
> > > has
> > > > column headers, then the names of the columns will be the names of
> the
> > > > corresponding vertex property. Otherwise, It'll be prop1, prop2,
> etc...
> > > > (There are other ways to do it as well, but I'm just trying to show
> the
> > > > general idea)
> > > >
> > > > *Edges*
> > > > The second step is loading the edges csv file which will be in the
> > > > following format
> > > >
> > > > vertex1 prop name (source vertex), vertex2 prop name (destination
> > > vertex),
> > > > bidirectional (TRUE/FALSE), prop1,prop2,prop3,etc...
> > > >
> > > > For each line in the edge csv file, the reader will search for a
> vertex
> > > > with the vertex1 prop value (caller need to ensure it's unique) to
> find
> > > the
> > > > source vertex, search for a destination vertex with destination prop
> > > value
> > > > and then create an edge that ties the two together. We will be
> creating
> > > an
> > > > edge property for each additional property on the line.
> > > >
> > > > Thoughts?
> > > >
> > > > Alaa
> > > >
> > >
> >
>

Re: [DISCUSS] Add native CSV loading support for gremlin (GraphReader)

Reply via email to