Re: [GSoC 2014] Data Tables for SPARQL

Andy Seaborne Tue, 04 Mar 2014 03:21:07 -0800

One extra observation:

If the structure is


+ CSV -> RDF data table
+ RDF Data table and query execution

and then execute the query with that data table (as well as everythingelse) then the RDF data table might come from CSV but it might come fromother sources.


Possibilities include:

* A previous query - record the results of query and reuse in laterqueries (this is essentially a cache and also a way to avoid writing thesame pattern over and over again).


* A file in RDF tuple syntax or RDF result syntax (mainly for testing!)

* a regular process that runs and pre-calculates certain important patterns

This project does not have to cover all those possibilities - it shouldget the architecture right so that can all happen.

Storing an RDF data table persistently (other than a text format),reusing TDB machinery would be nice but it is a very different part ofthe codebase to work with, so I'm suggesting the project doesn't try toinclude that this time.


        Andy



On 03/03/14 22:26, Andy Seaborne wrote:

On 03/03/14 03:12, Ying Jiang wrote:

Hi Andy,


Hi Ying,


Thanks for your suggestions! I'm more interested in JENA-625 (Data
Tables for SPARQL). I've seen your new comments in JIRA and studied
the source code of Tarql. I'd like to paste your comments here with my
questions below to clarify the details of this project:

1. CSV to RDF terms (tuples of RDF Terms is already supported
internally in Jena)
  - Questions:
1.1 Tarql uses the first row of CSV as variable names. Should we use
the same idea?


Seems like good start although care is needed because the column can be
anything and SPARQL variables are restricted.

If there is no header row, and we can require that app should say so by
some mechanism, or if the app wants different names, then a way to
provide that, falling back to something predicable if dull: ?_col1,
?_col2, ...

See below - there's no need to have fixed variable names.

1.2 As to "internal support of tuples of RDF terms in Jena", do you
mean com.hp.hpl.jena.sparql.algebra.table.TableData? Tarql uses
TableData to accommodate RDF term bindings from CSV.


That and there is also some RDF tuples code to read/write a textual form
as well:

https://svn.apache.org/repos/asf/jena/Experimental/rdfpatch/src/main/java/org/apache/jena/riot/tio/


(there are other versions of this code around - this is the ready to use
form)

2. Storage of the table (in-memory is enough, with reading from a file).
  - Questions:
2.1 What's the life cycle of the in-memory table? Should we discard
the table after the query execution, or keep it in-memory for later
reuse with the same query or update, or use by a subsequent query?
When will the table be discarded?


That'll need refining but a way to read and reuse.  There needs to be
away for the app to pass in tables (a Map<Sting, ???> and a tool
forerading CSVs to get the ???) because ...

3. Modify the SPARQL grammar to support FROM TABLE and TABLE (for
inclusion inside a larger query, c.f. SPARQL VALUES clause).
  - Questions:
3.1 What're the differences between FROM TABLE and TABLE?


FROM TABLE would be one way to get tables into the query as would
passing it in in the query context.

Queries can't be assumed to

TABLE in a query is accessing the table, using it to get the

TARQL, and I've only read the documentation, is a query over a single
CSV file.  This project should be about multiple CSVs and combining with
other RDF data.

A quick sketch and the syntax is not checked as sensible:

SELECT ... {
   # Fixed column names
   TABLE <uri> {
      BIND (URI(CONCAT('http://example.com/ns#', ?b)) AS ?uri)
      BIND (STRLANG(?a, 'en') AS ?with_language_tag)
      FILTER (?v > 57)
   }
}

More ambitious to have column naming and FILTERs:

SELECT ...
WHERE {

    TABLE <uri> { "col1" AS ?myVar1 ,
                  "col10" AS ?V ,
                  "col5" AS ?appName
                  FILTER(?V > 57) }
}

creates a set of bindings based on access description.

3.2 Tarql programmatically modify the query (parsed from standard
SPARQLParser11) with CSV tabsle data without touching the orginal
SPARQL grammar parsing module. Should we adopt a different approach of
modifying the parsing grammar of .jj files and just ask javacc to
generate the new parsing code?


I think the latter if possible.

This, like all projects, will need to move to a detailed design but I
don't hink it puts the project as a whole at risk.  The basis TARQL idea
would be a great addition

     Andy


4. Modify execution to include tables.
Questions: No questions for this now.

Best regards,
Ying Jiang

On Thu, Feb 27, 2014 at 10:49 PM, Andy Seaborne <[email protected]> wrote:

On 26/02/14 15:14, Ying Jiang wrote:


Hi,

With the great guidance from the mentors, especially Andy, I had a
good time in GSoC 2013 working on jena-spatial [1]. I'm very grateful.
Really learnt a lot from that project.

This year, I find the issue of "Extend CONSTRUCT to build quads" [1]
very interesting. I've used javacc before. I can understand the ARQ
module of parsing SPARQL strings. With a label of "gsoc2014", is it a
suitable project for Jena in GSoC 2014? Any more details about the
project? Thanks!

Best regards,
Ying Jiang

[1] http://jena.apache.org/documentation/query/spatial-query.html
[2] https://issues.apache.org/jira/browse/JENA-491


Hi there,

Given your level of skill and expertise, this project is possibly a bit
small for you.  It's not the same scale as jena-spatial. It's
probably more
suited to an undergraduate or someone looking to learn about working
inside
a moderately large existing codebase. You have a lot more software
engineering experience.

Can I interest you in one of:

* JENA-625 especially the part about CSV ingestion.  There is now a W3C
working group looking at tabular data on the web so we know this is
interesting to the user community.

* JENA-647, (only just added) which is server side query templates for
creating data views.

In conjunction with someone (else) doing JENA-632 (custom JSON from
SPARQL
query), we would have a data delivery platform for creating domain
specific
data delivery for webapp.

(this was provided in the proprietary Talis platform as "SPARQL Stored
Procedures" but that no longer exists.  No need to exactly follow
that but
it was a popular feature so it is useful).

* JENA-624 which is about a new memory-based storage layer.  As a
project,
its nearer in scale to jena-spatial.  This is less about RDF and
linked data
and more about systems programming.

         Andy

Re: [GSoC 2014] Data Tables for SPARQL

Reply via email to