Dylan - thanks for your input. What you said actually gets at the direction I was heading when I asked about how "types" would be handled and underscores what I perceive is a greater level of complexity for this task than is handled well in the standard GraphReader/Writer interfaces.
On Wed, Dec 2, 2015 at 11:48 AM, Dylan Bethune-Waddell < [email protected]> wrote: > I wrote a command line utility in Groovy that would do this for Titan - > here's how it worked: > > 1) Either a file or directory path for vertices/edges was passed. > 2) Optional regex for extracting the vertex label from the file name(s). > - Default is to split on underscores/dash/whitespace and take > element [0] (the label in the file would give more flexibility). > - These files are batched according to available processors. > - A transaction was opened to load each file from each batch. > 3) Vertices - 1st column as id property, remaining additional props. > - Should just be selection of the desired named/positional column. > - The user should be able to provide an id mapping file: > a. Restricts ids they care to load by the mapped-to ids. > b. Shows coverage of their intended id conversion over file lines. > 4) Edges - 1st column v1 id, 2nd label, 3rd v2 id, rest edge props. > - Should also be generalized to selection/configuration by user. > 5) Type - append after colon to the column header e.g. "name:int" > - Type is often inferred from the first hundred lines of the file. > - But when inconsistencies are further along than that, ugh. > > I never managed to get the "interactive" part working before I moved > on from this, but I think it's essential as the user should not have to > hack on the CSV data much to get it to load. My idea was displaying > the file headers, getting the user to mark which has the "identifier" > (for Titan was just a property key under a unique index), asking them > if they have a map file for that identifier, and finally asking them to > confirm the types we inferred based on the first 100 lines or a sampling > of lines or whatever with an option to "just do it already". Then, if the > user is trying to load a gazillion CSV files from a directory or set of > directories, we just ask them for "profiles" like this to apply per > directory, > per file name matching some regex or criteria about its n x m column > shape, or something else to distinguish multiple files from each other. > Same general thing applies to edges. Of course, all this should be > possible to tuck away in a configuration file, or provide as arguments > to a "builder" in the REPL somehow - I think that could get confusing > fast, but with similar hand-holding to the above it could be workable. > > For parsing the file, I think it needs reasonable defaults but like most > CSV parsing frameworks, provide the option to change the quote > character, line terminator, delimiter, skip n lines at the front, n lines > at > the back, and all that stuff. > > Hope that helps somewhat - sorry for the spam if this could have gone > unsaid. > > ________________________________________ > From: Stephen Mallette <[email protected]> > Sent: Wednesday, December 2, 2015 6:55 AM > To: [email protected] > Subject: Re: [DISCUSS] Add native CSV loading support for gremlin > (GraphReader) > > Thanks for bringing this up for discussion and offering to work on it. You > don't make mention of how you will deal with data types - will you have > some way to give users some fine-grained control of that? > > > > > > On Tue, Dec 1, 2015 at 10:46 AM, Alaa Mahmoud <[email protected]> > wrote: > > > Adding support for loading CSV into a graph using Gremlin's GraphReader > > will lower the entry barrier for new users. A lot of data is already in > CSV > > format and a lot of existing databases/repositories allow users to export > > their data as CSV. > > > > I'd like to add this capability to the gremlin core as a new GraphReader > > instance. Since the CSV data doesn't map directly to nodes and vertexes, > > I'm planning to do the loading on two steps: > > > > *Nodes* > > The first is to load a CSV as vertex CSV file. I'll create a node for > every > > line in the csv and a property for each column on that line. If the csv > has > > column headers, then the names of the columns will be the names of the > > corresponding vertex property. Otherwise, It'll be prop1, prop2, etc... > > (There are other ways to do it as well, but I'm just trying to show the > > general idea) > > > > *Edges* > > The second step is loading the edges csv file which will be in the > > following format > > > > vertex1 prop name (source vertex), vertex2 prop name (destination > vertex), > > bidirectional (TRUE/FALSE), prop1,prop2,prop3,etc... > > > > For each line in the edge csv file, the reader will search for a vertex > > with the vertex1 prop value (caller need to ensure it's unique) to find > the > > source vertex, search for a destination vertex with destination prop > value > > and then create an edge that ties the two together. We will be creating > an > > edge property for each additional property on the line. > > > > Thoughts? > > > > Alaa > > >
