My intent is to add a new instance of GraphReader (GraphCSVReader) to read csv files but If the community feels it should be a 3rd party IO then that's fine as well.
Alaa On Fri, Dec 4, 2015 at 10:28 AM, Stephen Mallette <[email protected]> wrote: > > > > How about we create and enhanced version of GraphReader that takes a > schema > > and a parser parameters (separator character, column headers or not, > > encoding,etc...)? > > > I'd be against making that kind of change as Schema is not a first class > citizen in TinkerPop. We haven't yet made that leap to include that and > I'd say any proposal to deal with that in IO would need to be dealt with in > the much broader terms of the whole TinkerPop ecosystem. You can search > the lists for the various discussions that have been had on that if you are > interested. > > On a separate note, I'm not sure if you've explained what your intent is, > but my personal opinion is that you should develop this capability in your > own repo and offer it to the community as a third-party IO library. Not > sure how others feel about it. > > On Fri, Dec 4, 2015 at 10:17 AM, Alaa Mahmoud <[email protected]> > wrote: > > > Thanks Stephen and Dylan for your response. I was trying to work within > > what GraphReader currently offers which isn't idea. Ideally GraphReader > > will allow to pass a schema for the file being rather than having each > > instance figure out a way to do that. The less we require users to modify > > their data before using TP, the more successful we'll be. > > > > How about we create and enhanced version of GraphReader that takes a > schema > > and a parser parameters (separator character, column headers or not, > > encoding,etc...)? > > > > Dylan, thanks for sharing how your tool works and it has the same essence > > as what I'm trying to do. I like the idea of having the types in the > column > > header next to each column name. > > > > We can start with something simple and then enhance it as we go. > > > > Regards > > > > On Wed, Dec 2, 2015 at 11:48 AM, Dylan Bethune-Waddell < > > [email protected]> wrote: > > > > > I wrote a command line utility in Groovy that would do this for Titan - > > > here's how it worked: > > > > > > 1) Either a file or directory path for vertices/edges was passed. > > > 2) Optional regex for extracting the vertex label from the file > name(s). > > > - Default is to split on underscores/dash/whitespace and take > > > element [0] (the label in the file would give more flexibility). > > > - These files are batched according to available processors. > > > - A transaction was opened to load each file from each batch. > > > 3) Vertices - 1st column as id property, remaining additional props. > > > - Should just be selection of the desired named/positional column. > > > - The user should be able to provide an id mapping file: > > > a. Restricts ids they care to load by the mapped-to ids. > > > b. Shows coverage of their intended id conversion over file > lines. > > > 4) Edges - 1st column v1 id, 2nd label, 3rd v2 id, rest edge props. > > > - Should also be generalized to selection/configuration by user. > > > 5) Type - append after colon to the column header e.g. "name:int" > > > - Type is often inferred from the first hundred lines of the file. > > > - But when inconsistencies are further along than that, ugh. > > > > > > I never managed to get the "interactive" part working before I moved > > > on from this, but I think it's essential as the user should not have to > > > hack on the CSV data much to get it to load. My idea was displaying > > > the file headers, getting the user to mark which has the "identifier" > > > (for Titan was just a property key under a unique index), asking them > > > if they have a map file for that identifier, and finally asking them to > > > confirm the types we inferred based on the first 100 lines or a > sampling > > > of lines or whatever with an option to "just do it already". Then, if > the > > > user is trying to load a gazillion CSV files from a directory or set of > > > directories, we just ask them for "profiles" like this to apply per > > > directory, > > > per file name matching some regex or criteria about its n x m column > > > shape, or something else to distinguish multiple files from each other. > > > Same general thing applies to edges. Of course, all this should be > > > possible to tuck away in a configuration file, or provide as arguments > > > to a "builder" in the REPL somehow - I think that could get confusing > > > fast, but with similar hand-holding to the above it could be workable. > > > > > > For parsing the file, I think it needs reasonable defaults but like > most > > > CSV parsing frameworks, provide the option to change the quote > > > character, line terminator, delimiter, skip n lines at the front, n > lines > > > at > > > the back, and all that stuff. > > > > > > Hope that helps somewhat - sorry for the spam if this could have gone > > > unsaid. > > > > > > ________________________________________ > > > From: Stephen Mallette <[email protected]> > > > Sent: Wednesday, December 2, 2015 6:55 AM > > > To: [email protected] > > > Subject: Re: [DISCUSS] Add native CSV loading support for gremlin > > > (GraphReader) > > > > > > Thanks for bringing this up for discussion and offering to work on it. > > You > > > don't make mention of how you will deal with data types - will you have > > > some way to give users some fine-grained control of that? > > > > > > > > > > > > > > > > > > On Tue, Dec 1, 2015 at 10:46 AM, Alaa Mahmoud <[email protected]> > > > wrote: > > > > > > > Adding support for loading CSV into a graph using Gremlin's > GraphReader > > > > will lower the entry barrier for new users. A lot of data is already > in > > > CSV > > > > format and a lot of existing databases/repositories allow users to > > export > > > > their data as CSV. > > > > > > > > I'd like to add this capability to the gremlin core as a new > > GraphReader > > > > instance. Since the CSV data doesn't map directly to nodes and > > vertexes, > > > > I'm planning to do the loading on two steps: > > > > > > > > *Nodes* > > > > The first is to load a CSV as vertex CSV file. I'll create a node for > > > every > > > > line in the csv and a property for each column on that line. If the > csv > > > has > > > > column headers, then the names of the columns will be the names of > the > > > > corresponding vertex property. Otherwise, It'll be prop1, prop2, > etc... > > > > (There are other ways to do it as well, but I'm just trying to show > the > > > > general idea) > > > > > > > > *Edges* > > > > The second step is loading the edges csv file which will be in the > > > > following format > > > > > > > > vertex1 prop name (source vertex), vertex2 prop name (destination > > > vertex), > > > > bidirectional (TRUE/FALSE), prop1,prop2,prop3,etc... > > > > > > > > For each line in the edge csv file, the reader will search for a > vertex > > > > with the vertex1 prop value (caller need to ensure it's unique) to > find > > > the > > > > source vertex, search for a destination vertex with destination prop > > > value > > > > and then create an edge that ties the two together. We will be > creating > > > an > > > > edge property for each additional property on the line. > > > > > > > > Thoughts? > > > > > > > > Alaa > > > > > > > > > >
