Re: [GSoC 2014] Data Tables for SPARQL

Ying Jiang Fri, 07 Mar 2014 04:29:26 -0800

Hi Andy,

For all the possibilities, are all of them supposed to be able to map
to a single interface, e.g. DataTable? It seems that, either a CSV
table, or the query result of a previous query, or the RDF
tuple/result syntax, can be transform into a DataTable. If so, we can
just code towards the interface of DataTable, and make some suitable
transformers for the different possibilities. Can we?


Any other data structures to be considered in this project? Or just tables?

Best regards,
Ying Jiang


On Tue, Mar 4, 2014 at 7:19 PM, Andy Seaborne <a...@apache.org> wrote:
> One extra observation:
>
> If the structure is
>
> + CSV -> RDF data table
> + RDF Data table and query execution
>
> and then execute the query with that data table (as well as everything else)
> then the RDF data table might come from CSV but it might come from other
> sources.
>
> Possibilities include:
> * A previous query - record the results of query and reuse in later queries
> (this is essentially a cache and also a way to avoid writing the same
> pattern over and over again).
>
> * A file in RDF tuple syntax or RDF result syntax (mainly for testing!)
>
> * a regular process that runs and pre-calculates certain important patterns
>
> This project does not have to cover all those possibilities - it should get
> the architecture right so that can all happen.
>
> Storing an RDF data table persistently (other than a text format), reusing
> TDB machinery would be nice but it is a very different part of the codebase
> to work with, so I'm suggesting the project doesn't try to include that this
> time.
>
>         Andy
>
>
>
>
> On 03/03/14 22:26, Andy Seaborne wrote:
>>
>> On 03/03/14 03:12, Ying Jiang wrote:
>>>
>>> Hi Andy,
>>
>>
>> Hi Ying,
>>
>>>
>>> Thanks for your suggestions! I'm more interested in JENA-625 (Data
>>> Tables for SPARQL). I've seen your new comments in JIRA and studied
>>> the source code of Tarql. I'd like to paste your comments here with my
>>> questions below to clarify the details of this project:
>>>
>>> 1. CSV to RDF terms (tuples of RDF Terms is already supported
>>> internally in Jena)
>>>   - Questions:
>>> 1.1 Tarql uses the first row of CSV as variable names. Should we use
>>> the same idea?
>>
>>
>> Seems like good start although care is needed because the column can be
>> anything and SPARQL variables are restricted.
>>
>> If there is no header row, and we can require that app should say so by
>> some mechanism, or if the app wants different names, then a way to
>> provide that, falling back to something predicable if dull: ?_col1,
>> ?_col2, ...
>>
>> See below - there's no need to have fixed variable names.
>>
>>> 1.2 As to "internal support of tuples of RDF terms in Jena", do you
>>> mean com.hp.hpl.jena.sparql.algebra.table.TableData? Tarql uses
>>> TableData to accommodate RDF term bindings from CSV.
>>
>>
>> That and there is also some RDF tuples code to read/write a textual form
>> as well:
>>
>>
>> https://svn.apache.org/repos/asf/jena/Experimental/rdfpatch/src/main/java/org/apache/jena/riot/tio/
>>
>>
>> (there are other versions of this code around - this is the ready to use
>> form)
>>
>>> 2. Storage of the table (in-memory is enough, with reading from a file).
>>>   - Questions:
>>> 2.1 What's the life cycle of the in-memory table? Should we discard
>>> the table after the query execution, or keep it in-memory for later
>>> reuse with the same query or update, or use by a subsequent query?
>>> When will the table be discarded?
>>
>>
>> That'll need refining but a way to read and reuse.  There needs to be
>> away for the app to pass in tables (a Map<Sting, ???> and a tool
>> forerading CSVs to get the ???) because ...
>>
>>> 3. Modify the SPARQL grammar to support FROM TABLE and TABLE (for
>>> inclusion inside a larger query, c.f. SPARQL VALUES clause).
>>>   - Questions:
>>> 3.1 What're the differences between FROM TABLE and TABLE?
>>
>>
>> FROM TABLE would be one way to get tables into the query as would
>> passing it in in the query context.
>>
>> Queries can't be assumed to
>>
>> TABLE in a query is accessing the table, using it to get the
>>
>> TARQL, and I've only read the documentation, is a query over a single
>> CSV file.  This project should be about multiple CSVs and combining with
>> other RDF data.
>>
>> A quick sketch and the syntax is not checked as sensible:
>>
>> SELECT ... {
>>    # Fixed column names
>>    TABLE <uri> {
>>       BIND (URI(CONCAT('http://example.com/ns#', ?b)) AS ?uri)
>>       BIND (STRLANG(?a, 'en') AS ?with_language_tag)
>>       FILTER (?v > 57)
>>    }
>> }
>>
>> More ambitious to have column naming and FILTERs:
>>
>> SELECT ...
>> WHERE {
>>
>>     TABLE <uri> { "col1" AS ?myVar1 ,
>>                   "col10" AS ?V ,
>>                   "col5" AS ?appName
>>                   FILTER(?V > 57) }
>> }
>>
>> creates a set of bindings based on access description.
>>
>>
>>> 3.2 Tarql programmatically modify the query (parsed from standard
>>> SPARQLParser11) with CSV tabsle data without touching the orginal
>>> SPARQL grammar parsing module. Should we adopt a different approach of
>>> modifying the parsing grammar of .jj files and just ask javacc to
>>> generate the new parsing code?
>>
>>
>> I think the latter if possible.
>>
>> This, like all projects, will need to move to a detailed design but I
>> don't hink it puts the project as a whole at risk.  The basis TARQL idea
>> would be a great addition
>>
>>      Andy
>>
>>>
>>> 4. Modify execution to include tables.
>>> Questions: No questions for this now.
>>>
>>> Best regards,
>>> Ying Jiang
>>>
>>> On Thu, Feb 27, 2014 at 10:49 PM, Andy Seaborne <a...@apache.org> wrote:
>>>>
>>>> On 26/02/14 15:14, Ying Jiang wrote:
>>>>>
>>>>>
>>>>> Hi,
>>>>>
>>>>> With the great guidance from the mentors, especially Andy, I had a
>>>>> good time in GSoC 2013 working on jena-spatial [1]. I'm very grateful.
>>>>> Really learnt a lot from that project.
>>>>>
>>>>> This year, I find the issue of "Extend CONSTRUCT to build quads" [1]
>>>>> very interesting. I've used javacc before. I can understand the ARQ
>>>>> module of parsing SPARQL strings. With a label of "gsoc2014", is it a
>>>>> suitable project for Jena in GSoC 2014? Any more details about the
>>>>> project? Thanks!
>>>>>
>>>>> Best regards,
>>>>> Ying Jiang
>>>>>
>>>>> [1] http://jena.apache.org/documentation/query/spatial-query.html
>>>>> [2] https://issues.apache.org/jira/browse/JENA-491
>>>>>
>>>>
>>>> Hi there,
>>>>
>>>> Given your level of skill and expertise, this project is possibly a bit
>>>> small for you.  It's not the same scale as jena-spatial. It's
>>>> probably more
>>>> suited to an undergraduate or someone looking to learn about working
>>>> inside
>>>> a moderately large existing codebase. You have a lot more software
>>>> engineering experience.
>>>>
>>>> Can I interest you in one of:
>>>>
>>>> * JENA-625 especially the part about CSV ingestion.  There is now a W3C
>>>> working group looking at tabular data on the web so we know this is
>>>> interesting to the user community.
>>>>
>>>> * JENA-647, (only just added) which is server side query templates for
>>>> creating data views.
>>>>
>>>> In conjunction with someone (else) doing JENA-632 (custom JSON from
>>>> SPARQL
>>>> query), we would have a data delivery platform for creating domain
>>>> specific
>>>> data delivery for webapp.
>>>>
>>>> (this was provided in the proprietary Talis platform as "SPARQL Stored
>>>> Procedures" but that no longer exists.  No need to exactly follow
>>>> that but
>>>> it was a popular feature so it is useful).
>>>>
>>>> * JENA-624 which is about a new memory-based storage layer.  As a
>>>> project,
>>>> its nearer in scale to jena-spatial.  This is less about RDF and
>>>> linked data
>>>> and more about systems programming.
>>>>
>>>>          Andy
>>>>
>>
>

Re: [GSoC 2014] Data Tables for SPARQL

Reply via email to