Dear Andy, I've submitted a proposal [1] to GSoC, according to our previous discussions. Please let me know if anything can be improved. Thanks a lot!
Cheers, Ying Jiang [1] http://www.google-melange.com/gsoc/proposal/review/student/google/gsoc2014/jpz6311whu/5632763709358080 On Mon, Mar 17, 2014 at 10:17 PM, Andy Seaborne <[email protected]> wrote: > On 16/03/14 04:31, Ying Jiang wrote: >> >> Dear Andy, >> >> I greatly appreciate your detailed explanations. I've studied all the >> examples and the links you mentioned. I'll try to summarise here with >> further questions below: >> >> 1. We have 2 possible ways for the project: "variables-as-columns" and >> "property tables". I can understand both the ideas, thanks to your >> instructions. The former one has its issues you pointed out, and the >> latter one seems to make more sense for the users. Do you mean we >> should discard the former one and focus on the latter in this project? > > > Yes - "predicates-for-columns" = "property tables" > > From that, you can recover "variables-as-columns" by query pattern. The > reverse is messy at best. Either very unnatural variable names to stop > clashes or beign careful about scoping (and that will confuse people). > > >> 2. We can have some lessons learned from SQL-to-RDF work. But CSV >> (even regular-shaped CSV) is different from database in some ways, >> which requires us to dig in deeper on the details. Some questions >> like: > > > The W3C "CSV on the Web Working Group" [1] is working on a standard > mechanism for converting CSV to other forms, RDF included. The details of > that mechanism aren't clear yet and won't be in time for the project - it's > an area that (my current belief) will chop and change a fair bit in getting > to a final specification. > > > The area of CSV-RDF is bigger than a GSoC project anyway and fairly open > ended given all the sorts of the things people do with CSV files (e.g. > encoding author lists in fields). > > But there is a simpler case - one need is a "direct mapping" whereby a > CSV file with no additional metadata is mapped to RDF. I think we can focus > on a design for this in the project. > > The translation is fixed : blank node for each row (addresses the primary > key issue - and alternative below), the base URL of the CSV file is used to > generate the predicate names. > > Then, the project gets all the machinery working - otherwise the output will > CSV to RDF without the Jena architectural chnages to support it in the long > term. > > [1] https://www.w3.org/2013/csvw/wiki/Main_Page > > >> 2.1 How to determine the data type of the column? All the values in >> CSV are firstly parsed as Strings line by line. Suppose the parser >> found a number string of "123000.0", how can we know whether it's an >> integer, a float/double or even just a string in RDF? > > > Initially, they can be strings. > > Later, and maybe an option the user can turn on, then a dynamic choice which > is a posh way of saying attempt to parse it as an integer and if it passes, > it's an integer. Spreadsheets do this guessing. > > "Duck datatyping" - if it looks like an integer (decimal, double, date) it > is an integer (decimal, double, date). > > Actually, this is then the same as tokenizing and there is code to reuse to > do that. > > >> 2.2 How to deal with the namespaces? RDF requires that the subjects >> and the predicates are URIs. We need to pass in the namespaces (or >> just the default namespaces) to make URIs by combining the namespaces >> with the values in CSV. Things may get more complicated if different >> columns are to be bound with different namespaces. > > > Subject a can be blank nodes which is useful because each row is then a new > blank node. > > One row written in RDF might be: > > > [ csv:row 1 ; :Town "Southton" ; :Population 123000 ] . > > or > > > _:b0 csv:row 1 ; > :Town "Southton" ; > :Population 123000 . > > It's the same RDF triples (3 of them). > > For predicates, suppose the URL of the CSV file is <FILE> then the columns > can be <FILE#Town> and <FILE#Population>. > > Rules or SPARQL Update can be used to turn that into a better data model if > the users wants to write that code. > > >> 2.3 The hp 2006 report [1] says "Jena supports three kinds of property >> tables as well as a triple store". The "town" example you provided >> conforms to the "single-valued" property table. Shall we consider the >> others (e.g. the "multi-valued" one and the "triple store" one) in >> this project? Does Jena in the latest release still support these >> property tables? If so, where're the related source codes? > > > Single-valued. > > In the CSV-WG it looks like duplicate column names are not going to be > supported (at best, the parser has to make then unique by adding "1", "2" > etc). > > Despite what the report says, the code didn't make it into the public Jena > codebase. (And we have removed the old RDB subsystem it refers to.) > > >> 2.4 There's no "primary key" definition in CSV. All the RDF are not >> OWL in fact. How do we know the column in CSV is uniquely defining? It >> seems CSV lacks of some kind of "metadata" of the columns and the >> values. If we have such metadata, how to pass in the namespace of the >> IRI template of http://data/town/{Town} (something related to the >> question 2.2)? > > > It's not necessary to have a defined primary row - that is generated subject > URI. It might be nice if available but that's metadata. > > So one of: > 1/ The triples for each row have a blank node for subject > 2/ The triples for row N have a URI which is <FILE#_N>. > > In both cases, the subject node is generated automatically. > > >> 3. For the "property tables" way, it seems that all we need to do is >> to resolve the problems in 2., and to code "GraphCSV" accordingly. I >> can make the GraphCSV class by implementing the Graph interface. In >> this way, for Jena ARP, a CSV table is actually a Graph, without any >> differences from other types of Graphs. It looks like that there's no >> need to introduce TABLE and FROM TABLE clauses in the SPARQL language >> grammar. We can just use the existing GRAPH, FROM and FROM NAMED >> clauses for the CSV "property tables", can't we? > > > s/ARP/ARQ/ -- ARP is the RDF/XML parser; ARQ is the query engine :-) > > Yes - correct. > > In the later stages of the project, there is an item to make OpExecutor > (which is the class that actually drives the SPARQL execution) do better for > GraphCSV than just treating it as a Graph by accessing the PropertyTable > behind it. > > The big gain for PropertyTables is the space saving they enable as well as > the possibility of making them persistent in a special storage system (not > in this project but the design should not make that too hard at some later > time). > > Andy > > >> >> Best regards, >> Ying Jiang >> >> [1] http://www.hpl.hp.com/techreports/2006/HPL-2006-140.pdf >> >> >> >> On Mon, Mar 10, 2014 at 10:50 PM, Andy Seaborne <[email protected]> wrote: >>> >>> Hi Ying, >>> >>> Good questions. I'll try to give a response to the specific points >>> you've >>> brought up but also there is a different I want to put forward for >>> discussion. >>> >>> I'll write up a first draft of a project plan then we can see if the size >>> and scope is realistic. >>> >>> You asked about whether variables are column names. That is how TARQL >>> and >>> SPARQL VALUES works but I've realised there is a different approach and >>> it's >>> one that will give a better system. It is to translate the CSV to RDF, >>> and >>> this may be materialized or dynamically mapped. If "materialized" it's >>> likely to be a lot bigger; as "property tables" or somethign inspired by >>> that idea, it'll be more compact. >>> >>> There are some issues with variables-as-columns include: >>> >>> 1/ Fixed variable names don't combine with other part of a query pattern >>> very well. >>> >>> If there is common use of the same name it a join - that's what a natural >>> join in SQL is. If there are two tables, then ?a is overloaded. If >>> column >>> names are used to derive a variable name, we may not want to equate them >>> in >>> the query because column names in different CSV files weren't designed >>> with >>> that in mind. >>> >>> 2/ You can't describe (in RDF) the data very easily - e.g. annotate that >>> a >>> column is of years. >>> >>> 3/ It needs the language to change (i.e. TABLE to access it) >>> >>> In TARQL, which is focusing on a controlled transform from CSV to RDF, it >>> works out quite nicely - variables go into the CONSTRUCT template. It >>> produces RDF. >>> >>> Property tables are a style of approach where the CSV data is accessed as >>> RDF. >>> >>> The data table columns be predicate URIs. The data table itself is an >>> RDF >>> graph of regular structure. It can be accessed with normal (unmodified) >>> SPARQL syntax. It would be better if the storage and execution of that >>> part >>> of the SPARQL query were adapted to such regular data. Something for >>> after >>> getting an initial cut down. >>> >>> Suppose we have a CSV file: >>> ------------------- >>> Town,Population >>> Southton,123000 >>> Northville,654000 >>> ------------------- >>> >>> One header row, two data rows. >>> >>> Aside: this is regular-shaped CSV (and some CSV files are definitely not >>> regular at all!). There is the current editors working draft from the CSV >>> on >>> the Web Working Group (not yet published, likely to change, only part of >>> the >>> picture, etc etc) >>> >>> http://w3c.github.io/csvw/syntax/ >>> >>> which is defining a more regular data out of CSV. This is the target for >>> the CSV work: table shaped CSV; not arbitrary, irregularly shaped CSV. >>> >>> There is no way the working group will have standardised any CSV to RDF >>> mapping in the lifetime of the GSoC project but the WG charter says it >>> must >>> be covered. So the mapping below is made up and ahead of where the >>> working >>> group is currently but a standardized, "direct mapping" (no metadata, no >>> templates) style is going to happen. The mapping details may change but >>> the >>> general approach is clear. >>> >>> As RDF this might be >>> >>> ------------- >>> @prefix : <http://example/table> . >>> @prefix csv: <http://w3c/future-csv-vocab/> . >>> >>> [ csv:row 1 ; :Town "Southton" ; :Population 123000 ] . >>> [ csv:row 2 ; :Town "Northville" ; :Population 654000 ] . >>> ------------- >>> >>> or without the bnode abbreviation: >>> >>> ------------- >>> @prefix : <http://example/table> . >>> @prefix csv: <http://w3c/future-csv-vocab/> . >>> >>> _:b0 csv:row 1 ; >>> :Town "Southton" ; >>> :Population 123000 . >>> >>> _:b1 csv:row 2 ; >>> :Town "Northville" ; >>> :Population 654000 . >>> ------------- >>> >>> >>> Each row is modelling one "entity" (here, a population observation). >>> There >>> is a subject (a blank node) and one predicate-value for each cell of the >>> row. Row numbers are added because it can be important. >>> >>> Background: >>> >>> A related idea for property has come up before >>> >>> http://www.hpl.hp.com/techreports/2006/HPL-2006-140.html >>> >>> That paper should only be taken as giving a flavour. The motivation was >>> different, more about making RDF look like regular database especially >>> when >>> the data is regular. At the workshop last week, I talk to Orri Erling >>> (OpenLink/Virtuoso) and apparently, maybe by parallel evolution, Virtuoso >>> does something similar. >>> >>> >>> Aside: >>> There is a whole design space (outside this project) for translating CSV >>> to >>> RDF. >>> >>> Just if anyone is interested: see the related SQL-to-RDF work: >>> >>> http://www.w3.org/TR/r2rml/ >>> http://www.w3.org/TR/rdb-direct-mapping/ >>> >>> If the metadata said that one of the columns was uniquely defining (a >>> primary key in SQL terms, or inverse functional property in OWL-terms), >>> we >>> wouldn't need blank nodes at all - we could use a URI template, for if >>> town >>> names were unique (they are not!) a IRI template of >>> http://data/town/{Town} >>> would give: >>> >>> ------------- >>> @prefix : <http://example/table> . >>> @prefix csv: <http://w3c/future-csv-vocab/> . >>> >>> <http://data/town/Southton> >>> csv:row 1 ; >>> rdfs:label "Southton" ; >>> :Population 123000 . >>> >>> <http://data/town/Northville> >>> csv:row 2 ; >>> rdfs:label "Northville" ; >>> :Population 654000 . >>> ------------- >>> >>> Doing this transformation in rules is one route. JENA-650 connection? >>> </aside> >>> >>> In SPARQL: >>> >>> Now the CSV file is viewed as an graph - normal, unmodified SPARQL can be >>> used. Multiple CSVs files can be multiple graphs in one dataset to give >>> query across different data sources. >>> >>> # Towns over 500,000 people. >>> SELECT ?townName ?pop { >>> { GRAPH <http://example/population> { >>> ?x :Town ?townName ; >>> :Popuation ?pop . >>> FILTER(?pop > 500000) >>> } >>> } >>> >>> >>> A few comments inline - the bulk of this message is above. >>> >>> I hope this makes some sense. Having spent time with people who really >>> do >>> work with CSVs files last week around the linked geospatial workshop , >>> the >>> user needs and requirements are much clearer. >>> >>> Andy >>> >>> PS I was on a panel that included mentioning the work you did last year. >>> It >>> went well. >>> >>> On 07/03/14 12:10, Ying Jiang wrote: >>> ... >>> >>>>>> 2. Storage of the table (in-memory is enough, with reading from a >>>>>> file). >>>>>> - Questions: >>>>>> 2.1 What's the life cycle of the in-memory table? Should we discard >>>>>> the table after the query execution, or keep it in-memory for later >>>>>> reuse with the same query or update, or use by a subsequent query? >>>>>> When will the table be discarded? >>>>> >>>>> >>>>> >>>>> >>>>> That'll need refining but a way to read and reuse. There needs to be >>>>> away >>>>> for the app to pass in tables (a Map<Sting, ???> and a tool forerading >>>>> CSVs >>>>> to get the ???) because ... >>>> >>>> >>>> >>>> When will the tables be passed in? TARQL loads the CSVs when parsing >>>> the SPARQL query string. Shall we load the tables and create the Map >>>> before querying and cache them for resue? This could be similar to >>>> querying a Dataset, and the simplest way goes something like: >>>> >>>> DataTableMap<String, DataTable> dtm = >>>> DataTableSetFactory.createDataTableMap(); // The keys of dts are the >>>> URI of the DataTables loaded. >>>> dtm.addDataTable( "<ex:table_1>", "file:table_1.csv", true); // The >>>> table data are loaded when added into the map. >>>> dtm.addDataTable( "<ex:table_2>", "file:table_2.csv", false); // Or >>>> the table data are *lazy* loaded during querying later on, i.e. not >>>> loaded now. >>>> Query query = QueryFactory.create(queryString) ; // New .jj will be >>>> created for parsing TABLE and FROM TABLE clauses. However the >>>> QueryFactory interface remains the same as before. >>>> QueryExecution qExec = QueryExecutionFactory.create(query, model, >>>> dtm) ; // New create method for QueryExecutionFactory to accomendate >>>> dtm >>>> ... //dtm can be reused later on for other QueryExecutions, or be >>>> discarded when the app ends. >>>> >>>> Is the above what you mean? Any comments? >>> >>> >>> >>> Yes, using TABLE. >>> >>> With property tables it can be done as >>> >>> // Default graph of the dataset >>> >>> Model csv1 = >>> ModelFactory.createModelForGraph(new GraphCSV("data1.csv")) ; >>> QueryExecution qExec = QueryExecutionFactory.create(query, csv1) ; >>> >>> or for multiple CSV files and/or other RDF data: >>> >>> Model csv1 = >>> ModelFactory.createModelForGraph(new GraphCSV("data1.csv")) ; >>> Model csv2 = >>> ModelFactory.createModelForGraph(new GraphCSV("data1.csv")) ; >>> >>> Dataset dataset = ... ; >>> dataset.addNamedModel("http://example/population", csv1) ; >>> dataset.addNamedModel("http://example/table2", csv2) ; >>> >>> ... normal SPARQL execution ... >>> >>> >>>>> >>>>> >>>>>> 3. Modify the SPARQL grammar to support FROM TABLE and TABLE (for >>>>>> inclusion inside a larger query, c.f. SPARQL VALUES clause). >>>>>> - Questions: >>>>>> 3.1 What're the differences between FROM TABLE and TABLE? >>>>> >>>>> >>>>> >>>>> >>>>> FROM TABLE would be one way to get tables into the query as would >>>>> passing >>>>> it >>>>> in in the query context. >>>>> >>>>> Queries can't be assumed to >>>>> >>>>> TABLE in a query is accessing the table, using it to get the >>>>> >>>>> TARQL, and I've only read the documentation, is a query over a single >>>>> CSV >>>>> file. This project should be about multiple CSVs and combining with >>>>> other >>>>> RDF data. >>>>> >>>>> A quick sketch and the syntax is not checked as sensible: >>>>> >>>>> SELECT ... { >>>>> # Fixed column names >>>>> TABLE <uri> { >>>>> BIND (URI(CONCAT('http://example.com/ns#', ?b)) AS ?uri) >>>>> BIND (STRLANG(?a, 'en') AS ?with_language_tag) >>>>> FILTER (?v > 57) >>>>> } >>>>> } >>>>> >>>>> More ambitious to have column naming and FILTERs: >>>>> >>>>> SELECT ... >>>>> WHERE { >>>>> >>>>> TABLE <uri> { "col1" AS ?myVar1 , >>>>> "col10" AS ?V , >>>>> "col5" AS ?appName >>>>> FILTER(?V > 57) } >>>>> } >>>>> >>>>> creates a set of bindings based on access description. >>>>> >>>> >>>> Are the <uri> after TABLE the key of the Map<Sting, ???>? If so, I now >>>> understand the TABLE clauses from the examples. However, still not >>>> sure about FROM TABLE. Could you please show me some query string >>>> examples containing the FROM TABLE clauses? >>> >>> >>> >>> FROM TABLE would set the map entry. c.f. FROM NAMED >>> >>> In this case the name of the table (graph) is the location it comes from >>> - >>> it's not a general choice of name. A common issue for FROM NAMED, not >>> specific to CSV processing. >>> >
