Re: [GSoC 2014] Data Tables for SPARQL

Ying Jiang Tue, 18 Mar 2014 21:23:28 -0700

Dear Andy,

I've submitted a proposal [1] to GSoC, according to our previous
discussions. Please let me know if anything can be improved.
Thanks a lot!


Cheers,
Ying Jiang

[1] 
http://www.google-melange.com/gsoc/proposal/review/student/google/gsoc2014/jpz6311whu/5632763709358080

On Mon, Mar 17, 2014 at 10:17 PM, Andy Seaborne <[email protected]> wrote:
> On 16/03/14 04:31, Ying Jiang wrote:
>>
>> Dear Andy,
>>
>> I greatly appreciate your detailed explanations. I've studied all the
>> examples and the links you mentioned. I'll try to summarise here with
>> further questions below:
>>
>> 1. We have 2 possible ways for the project: "variables-as-columns" and
>> "property tables". I can understand both the ideas, thanks to your
>> instructions. The former one has its issues you pointed out, and the
>> latter one seems to make more sense for the users. Do you mean we
>> should discard the former one and focus on the latter in this project?
>
>
> Yes - "predicates-for-columns" = "property tables"
>
> From that, you can recover "variables-as-columns" by query pattern. The
> reverse is messy at best. Either very unnatural variable names to stop
> clashes or beign careful about scoping (and that will confuse people).
>
>
>> 2. We can have some lessons learned from SQL-to-RDF work. But CSV
>> (even regular-shaped CSV) is different from database in some ways,
>> which requires us to dig in deeper on the details. Some questions
>> like:
>
>
> The W3C "CSV on the Web Working Group" [1] is working on a standard
> mechanism for converting CSV to other forms, RDF included.  The details of
> that mechanism aren't clear yet and won't be in time for the project - it's
> an area that (my current belief) will chop and change a fair bit in getting
> to a final specification.
>
>
> The area of CSV-RDF is bigger than a GSoC project anyway and fairly open
> ended given all the sorts of the things people do with CSV files (e.g.
> encoding author lists in fields).
>
> But there is a simpler case - one need is a "direct mapping" whereby a
> CSV file with no additional metadata is mapped to RDF.  I think we can focus
> on a design for this in the project.
>
> The translation is fixed : blank node for each row (addresses the primary
> key issue - and alternative below), the base URL of the CSV file is used to
> generate the predicate names.
>
> Then, the project gets all the machinery working - otherwise the output will
> CSV to RDF without the Jena architectural chnages to support it in the long
> term.
>
> [1] https://www.w3.org/2013/csvw/wiki/Main_Page
>
>
>> 2.1 How to determine the data type of the column? All the values in
>> CSV are firstly parsed as Strings line by line. Suppose the parser
>> found a number string of "123000.0", how can we know whether it's an
>> integer, a float/double or even just a string in RDF?
>
>
> Initially, they can be strings.
>
> Later, and maybe an option the user can turn on, then a dynamic choice which
> is a posh way of saying attempt to parse it as an integer and if it passes,
> it's an integer.  Spreadsheets do this guessing.
>
> "Duck datatyping" - if it looks like an integer (decimal, double, date) it
> is an integer (decimal, double, date).
>
> Actually, this is then the same as tokenizing and there is code to reuse to
> do that.
>
>
>> 2.2 How to deal with the namespaces? RDF requires that the subjects
>> and the predicates are URIs. We need to pass in the namespaces (or
>> just the default namespaces) to make URIs by combining the namespaces
>> with the values in CSV. Things may get more complicated if different
>> columns are to be bound with different namespaces.
>
>
> Subject a can be blank nodes which is useful because each row is then a new
> blank node.
>
> One row written in RDF might be:
>
>
> [ csv:row 1 ; :Town "Southton" ; :Population 123000 ] .
>
> or
>
>
> _:b0  csv:row 1 ;
>       :Town "Southton" ;
>       :Population 123000 .
>
> It's the same RDF triples (3 of them).
>
> For predicates, suppose the URL of the CSV file is <FILE> then the columns
> can be  <FILE#Town> and <FILE#Population>.
>
> Rules or SPARQL Update can be used to turn that into a better data model if
> the users wants to write that code.
>
>
>> 2.3 The hp 2006 report [1] says "Jena supports three kinds of property
>> tables as well as a triple store". The "town" example you provided
>> conforms to the "single-valued" property table. Shall we consider the
>> others (e.g. the "multi-valued" one and the "triple store" one) in
>> this project? Does Jena in the latest release still support these
>> property tables? If so, where're the related source codes?
>
>
> Single-valued.
>
> In the CSV-WG it looks like duplicate column names are not going to be
> supported (at best, the parser has to make then unique by adding "1", "2"
> etc).
>
> Despite what the report says, the code didn't make it into the public Jena
> codebase.  (And we have removed the old RDB subsystem it refers to.)
>
>
>> 2.4 There's no "primary key" definition in CSV. All the RDF are not
>> OWL in fact. How do we know the column in CSV is uniquely defining? It
>> seems CSV lacks of some kind of "metadata" of the columns and the
>> values. If we have such metadata, how to pass in the namespace of  the
>> IRI template of http://data/town/{Town} (something related to the
>> question 2.2)?
>
>
> It's not necessary to have a defined primary row - that is generated subject
> URI.  It might be nice if available but that's metadata.
>
> So one of:
> 1/ The triples for each row have a blank node for subject
> 2/ The triples for row N have a URI which is <FILE#_N>.
>
> In both cases, the subject node is generated automatically.
>
>
>> 3. For the "property tables" way, it seems that all we need to do is
>> to resolve the problems in 2., and to code "GraphCSV" accordingly. I
>> can make the GraphCSV class by implementing the Graph interface. In
>> this way, for Jena ARP, a CSV table is actually a Graph, without any
>> differences from other types of Graphs. It looks like that there's no
>> need to introduce TABLE and FROM TABLE clauses in the SPARQL language
>> grammar. We can just use the existing GRAPH, FROM and FROM NAMED
>> clauses for the CSV "property tables", can't we?
>
>
> s/ARP/ARQ/ -- ARP is the RDF/XML parser; ARQ is the query engine :-)
>
> Yes - correct.
>
> In the later stages of the project, there is an item to make OpExecutor
> (which is the class that actually drives the SPARQL execution) do better for
> GraphCSV than just treating it as a Graph by accessing the PropertyTable
> behind it.
>
> The big gain for PropertyTables is the space saving they enable as well as
> the possibility of making them persistent in a special storage system (not
> in this project but the design should not make that too hard at some later
> time).
>
>         Andy
>
>
>>
>> Best regards,
>> Ying Jiang
>>
>> [1] http://www.hpl.hp.com/techreports/2006/HPL-2006-140.pdf
>>
>>
>>
>> On Mon, Mar 10, 2014 at 10:50 PM, Andy Seaborne <[email protected]> wrote:
>>>
>>> Hi Ying,
>>>
>>> Good questions.  I'll try to give a response to the specific points
>>> you've
>>> brought up but also there is a different I want to put forward for
>>> discussion.
>>>
>>> I'll write up a first draft of a project plan then we can see if the size
>>> and scope is realistic.
>>>
>>> You asked about whether variables are column names.  That is how TARQL
>>> and
>>> SPARQL VALUES works but I've realised there is a different approach and
>>> it's
>>> one that will give a better system.  It is to translate the CSV to RDF,
>>> and
>>> this may be materialized or dynamically mapped. If "materialized" it's
>>> likely to be a lot bigger; as "property tables" or somethign inspired by
>>> that idea, it'll be more compact.
>>>
>>> There are some issues with variables-as-columns include:
>>>
>>> 1/ Fixed variable names don't combine with other part of a query pattern
>>> very well.
>>>
>>> If there is common use of the same name it a join - that's what a natural
>>> join in SQL is.  If there are two tables, then ?a is overloaded.  If
>>> column
>>> names are used to derive a variable name, we may not want to equate them
>>> in
>>> the query because column names in different CSV files weren't designed
>>> with
>>> that in mind.
>>>
>>> 2/ You can't describe (in RDF) the data very easily - e.g. annotate that
>>> a
>>> column is of years.
>>>
>>> 3/  It needs the language to change (i.e. TABLE to access it)
>>>
>>> In TARQL, which is focusing on a controlled transform from CSV to RDF, it
>>> works out quite nicely - variables go into the CONSTRUCT template. It
>>> produces RDF.
>>>
>>> Property tables are a style of approach where the CSV data is accessed as
>>> RDF.
>>>
>>> The data table columns be predicate URIs.  The data table itself is an
>>> RDF
>>> graph of regular structure.  It can be accessed with normal (unmodified)
>>> SPARQL syntax. It would be better if the storage and execution of that
>>> part
>>> of the SPARQL query were adapted to such regular data.  Something for
>>> after
>>> getting an initial cut down.
>>>
>>> Suppose we have a CSV file:
>>> -------------------
>>> Town,Population
>>> Southton,123000
>>> Northville,654000
>>> -------------------
>>>
>>> One header row, two data rows.
>>>
>>> Aside: this is regular-shaped CSV (and some CSV files are definitely not
>>> regular at all!). There is the current editors working draft from the CSV
>>> on
>>> the Web Working Group (not yet published, likely to change, only part of
>>> the
>>> picture, etc etc)
>>>
>>> http://w3c.github.io/csvw/syntax/
>>>
>>> which is defining a more regular data out of CSV.  This is the target for
>>> the CSV work: table shaped CSV; not arbitrary, irregularly shaped CSV.
>>>
>>> There is no way the working group will have standardised any CSV to RDF
>>> mapping in the lifetime of the GSoC project but the WG charter says it
>>> must
>>> be covered.  So the mapping below is made up and ahead of where the
>>> working
>>> group is currently but a standardized, "direct mapping" (no metadata, no
>>> templates) style is going to happen.  The mapping details may change but
>>> the
>>> general approach is clear.
>>>
>>> As RDF this might be
>>>
>>> -------------
>>> @prefix : <http://example/table> .
>>> @prefix csv: <http://w3c/future-csv-vocab/> .
>>>
>>> [ csv:row 1 ; :Town "Southton" ; :Population 123000 ] .
>>> [ csv:row 2 ; :Town "Northville" ; :Population 654000 ] .
>>> -------------
>>>
>>> or without the bnode abbreviation:
>>>
>>> -------------
>>> @prefix : <http://example/table> .
>>> @prefix csv: <http://w3c/future-csv-vocab/> .
>>>
>>> _:b0  csv:row 1 ;
>>>        :Town "Southton" ;
>>>        :Population 123000 .
>>>
>>> _:b1  csv:row 2 ;
>>>        :Town "Northville" ;
>>>        :Population 654000 .
>>> -------------
>>>
>>>
>>> Each row is modelling one "entity" (here, a population observation).
>>> There
>>> is a subject (a blank node) and one predicate-value for each cell of the
>>> row.  Row numbers are added because it can be important.
>>>
>>> Background:
>>>
>>> A related idea for property has come up before
>>>
>>>    http://www.hpl.hp.com/techreports/2006/HPL-2006-140.html
>>>
>>> That paper should only be taken as giving a flavour. The motivation was
>>> different, more about making RDF look like regular database especially
>>> when
>>> the data is regular.  At the workshop last week, I talk to Orri Erling
>>> (OpenLink/Virtuoso) and apparently, maybe by parallel evolution, Virtuoso
>>> does something similar.
>>>
>>>
>>> Aside:
>>> There is a whole design space (outside this project) for translating CSV
>>> to
>>> RDF.
>>>
>>> Just if anyone is interested: see the related SQL-to-RDF work:
>>>
>>> http://www.w3.org/TR/r2rml/
>>> http://www.w3.org/TR/rdb-direct-mapping/
>>>
>>> If the metadata said that one of the columns was uniquely defining (a
>>> primary key in SQL terms, or inverse functional property in OWL-terms),
>>> we
>>> wouldn't need blank nodes at all - we could use a URI template, for if
>>> town
>>> names were unique (they are not!) a IRI template of
>>> http://data/town/{Town}
>>> would give:
>>>
>>> -------------
>>> @prefix : <http://example/table> .
>>> @prefix csv: <http://w3c/future-csv-vocab/> .
>>>
>>> <http://data/town/Southton>
>>>        csv:row 1 ;
>>>        rdfs:label "Southton" ;
>>>        :Population 123000 .
>>>
>>> <http://data/town/Northville>
>>>        csv:row 2 ;
>>>        rdfs:label "Northville" ;
>>>        :Population 654000 .
>>> -------------
>>>
>>> Doing this transformation in rules is one route.  JENA-650 connection?
>>> </aside>
>>>
>>> In SPARQL:
>>>
>>> Now the CSV file is viewed as an graph - normal, unmodified SPARQL can be
>>> used.  Multiple CSVs files can be multiple graphs in one dataset to give
>>> query across different data sources.
>>>
>>> # Towns over 500,000 people.
>>> SELECT ?townName ?pop {
>>> { GRAPH <http://example/population> {
>>>      ?x :Town ?townName ;
>>>         :Popuation ?pop .
>>>      FILTER(?pop > 500000)
>>>    }
>>> }
>>>
>>>
>>> A few comments inline - the bulk of this message is above.
>>>
>>> I hope this makes some sense.  Having spent time with people who really
>>> do
>>> work with CSVs files last week around the linked geospatial workshop ,
>>> the
>>> user needs and requirements are much clearer.
>>>
>>>          Andy
>>>
>>> PS I was on a panel that included mentioning the work you did last year.
>>> It
>>> went well.
>>>
>>> On 07/03/14 12:10, Ying Jiang wrote:
>>> ...
>>>
>>>>>> 2. Storage of the table (in-memory is enough, with reading from a
>>>>>> file).
>>>>>>     - Questions:
>>>>>> 2.1 What's the life cycle of the in-memory table? Should we discard
>>>>>> the table after the query execution, or keep it in-memory for later
>>>>>> reuse with the same query or update, or use by a subsequent query?
>>>>>> When will the table be discarded?
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> That'll need refining but a way to read and reuse.  There needs to be
>>>>> away
>>>>> for the app to pass in tables (a Map<Sting, ???> and a tool forerading
>>>>> CSVs
>>>>> to get the ???) because ...
>>>>
>>>>
>>>>
>>>> When will the tables be passed in? TARQL loads the CSVs when parsing
>>>> the SPARQL query string. Shall we load the tables and create the Map
>>>> before querying and cache them for resue? This could be similar to
>>>> querying a Dataset, and the simplest way goes something like:
>>>>
>>>> DataTableMap<String, DataTable> dtm =
>>>> DataTableSetFactory.createDataTableMap(); // The keys of dts are the
>>>> URI of the DataTables loaded.
>>>> dtm.addDataTable( "<ex:table_1>", "file:table_1.csv", true); // The
>>>> table data are loaded when added into the map.
>>>> dtm.addDataTable( "<ex:table_2>", "file:table_2.csv", false); // Or
>>>> the table data are *lazy* loaded during querying later on, i.e. not
>>>> loaded now.
>>>> Query query = QueryFactory.create(queryString) ; // New .jj will be
>>>> created for parsing TABLE and FROM TABLE clauses. However the
>>>> QueryFactory interface remains the same as before.
>>>> QueryExecution qExec = QueryExecutionFactory.create(query, model,
>>>> dtm) ; // New create method for QueryExecutionFactory to accomendate
>>>> dtm
>>>> ... //dtm can be reused later on for other QueryExecutions, or be
>>>> discarded when the app ends.
>>>>
>>>> Is the above what you mean? Any comments?
>>>
>>>
>>>
>>> Yes, using TABLE.
>>>
>>> With property tables it can be done as
>>>
>>> // Default graph of the dataset
>>>
>>> Model csv1 =
>>>    ModelFactory.createModelForGraph(new GraphCSV("data1.csv")) ;
>>> QueryExecution qExec = QueryExecutionFactory.create(query, csv1) ;
>>>
>>> or for multiple CSV files and/or other RDF data:
>>>
>>> Model csv1 =
>>>    ModelFactory.createModelForGraph(new GraphCSV("data1.csv")) ;
>>> Model csv2 =
>>>    ModelFactory.createModelForGraph(new GraphCSV("data1.csv")) ;
>>>
>>> Dataset dataset = ... ;
>>> dataset.addNamedModel("http://example/population";, csv1) ;
>>> dataset.addNamedModel("http://example/table2";, csv2) ;
>>>
>>> ... normal SPARQL execution ...
>>>
>>>
>>>>>
>>>>>
>>>>>> 3. Modify the SPARQL grammar to support FROM TABLE and TABLE (for
>>>>>> inclusion inside a larger query, c.f. SPARQL VALUES clause).
>>>>>>     - Questions:
>>>>>> 3.1 What're the differences between FROM TABLE and TABLE?
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> FROM TABLE would be one way to get tables into the query as would
>>>>> passing
>>>>> it
>>>>> in in the query context.
>>>>>
>>>>> Queries can't be assumed to
>>>>>
>>>>> TABLE in a query is accessing the table, using it to get the
>>>>>
>>>>> TARQL, and I've only read the documentation, is a query over a single
>>>>> CSV
>>>>> file.  This project should be about multiple CSVs and combining with
>>>>> other
>>>>> RDF data.
>>>>>
>>>>> A quick sketch and the syntax is not checked as sensible:
>>>>>
>>>>> SELECT ... {
>>>>>     # Fixed column names
>>>>>     TABLE <uri> {
>>>>>        BIND (URI(CONCAT('http://example.com/ns#', ?b)) AS ?uri)
>>>>>        BIND (STRLANG(?a, 'en') AS ?with_language_tag)
>>>>>        FILTER (?v > 57)
>>>>>     }
>>>>> }
>>>>>
>>>>> More ambitious to have column naming and FILTERs:
>>>>>
>>>>> SELECT ...
>>>>> WHERE {
>>>>>
>>>>>      TABLE <uri> { "col1" AS ?myVar1 ,
>>>>>                    "col10" AS ?V ,
>>>>>                    "col5" AS ?appName
>>>>>                    FILTER(?V > 57) }
>>>>> }
>>>>>
>>>>> creates a set of bindings based on access description.
>>>>>
>>>>
>>>> Are the <uri> after TABLE the key of the Map<Sting, ???>? If so, I now
>>>> understand the TABLE clauses from the examples. However, still not
>>>> sure about FROM TABLE. Could you please show me some query string
>>>> examples containing the FROM TABLE clauses?
>>>
>>>
>>>
>>> FROM TABLE would set the map entry.  c.f. FROM NAMED
>>>
>>> In this case the name of the table (graph) is the location it comes from
>>> -
>>> it's not a general choice of name.  A common issue for FROM NAMED, not
>>> specific to CSV processing.
>>>
>

Re: [GSoC 2014] Data Tables for SPARQL

Reply via email to