Re: [d2rq-dev] Using the assembler to union d2rq to a TDB model [SEC=UNCLASSIFIED]

Paul Murray Thu, 26 Feb 2015 21:45:39 -0800

Apologies for the long email, but I'm trying to describe the problem 
completely. I'll write this mail as I attempt to make what I want to happen, 
happen. Maybe in the process of writing this I will solve it for myself :)

(TL;DR: I got it going! Yay! But I'll send this saga out anyway because it took 
me most of the afternoon to write it.)

On 26/02/2015, at 9:30 PM, Richard Cyganiak wrote:

> Hi Paul,
> 
> A quick initial response.
> 
> Even though D2RQ is made for directly querying a relational DB with SPARQL, 
> if you need to integrate data from multiple sources, I recommend dumping them 
> all to RDF and loading them into a single RDF store. This is the best way to 
> get performance and reliability. Of course, it may not be possible for you 
> due to database size or quickly changing data.

At present, my data sources are:

1 - A set of OWL ontologies in static RDF. I load these so that SPARQL can be 
written over them - perhaps I might even be able to drive the Pellet reasoner 
off them. There are 50 or so. These include our local ontologies as well as a 
couple of standard ones - SKOS, TDWG (Taxonomic Database Working Group), dublin 
core, darwin core.

2 - Three separate TDB data sets
** The Australian Faunal Directory (AFD) data. 18G. This gets built by a batch 
process that takes a day and a bit to complete. We would like to update it once 
a month, or even weekly.
** The old Australian Plant Names Index (APNI) 10G. This used to be extracted 
alongside AFD, but this database is not longer being updated. The legacy data 
needs to remain present on the semantic web. 
** A load of the Catalogue of Life, 2011. 34G. Again, this will no longer get 
updated, I think (mainly because no-one is paying for it to be done). The key 
part of this data is that it contains mappings from the COL identifiers to AFD 
and APNI ones.

3 - The new APNI. This is the dataset that I would like to expose via d2r. 

I keep this dataset organised by splitting it into named graphs. The named 
graph named 'meta'
        <http://biodiversity.org.au/voc/graph/GRAPH#meta>

describes the named graphs on the service. This is not the default graph of the 
SPARQL endpoint … but perhaps it should be. At present the default graph on the 
server is an empty one because I was concerned that I might be exposing it to 
an update service.

So, you can see that there's reasons for building this hetrogenous mish-mash of 
multiple types of data sources. I have big data sets, updated on different 
schedules, using systems that have been in place and working for a couple of 
years now. Having the datasets stored as RDF and them loading them into memory 
with a parser is really not a great option. What I want is to append a D2R 
graph into this thing.

> The D2RQ assembler works for me with the version of Joseki that ships with 
> D2RQ, Joseki 3.4.4.
> 
> I’m not sure what you mean by “external assembler file.” Can you explain or 
> give an example?

Ok, this is great news and what I was hoping to hear.

Steps are -
1 - get a standalone install of Joseki 3.4.4 running with an assembler with a 
couple of simple graphs.
2 - Get the joskei installation inside d2r doing this - acting as a sparql 
endpoint, serving up static graphs.
3 - Get the joseki installation inside d2r serving up the static graphs 
alongside a very simple d2r graph
4 - jamming the big thing into it.

Currently, the problem is step 2.

=== STEP 1 - A simple assembler that works ok with Joseki 3.4.4 ===

Let's start with the basics - I want to launch a sparql endpoint in the d2r 
distribution with just a couple of static, empty named graphs. 
Once I have that going, jamming all of this other stuff in there is simply a 
mater of editing the assembler.

D2R is using Joseki 3.4.4 . Lets write a mapping file that I know works against 
3.4.4 . To do this, I will grab a copy of Joseki 3.4.4.

. . .

OK. I save this assembler as simple-setup.ttl, and write a little 
sampledata.ttl with one row in it .

-----------------------------------

@prefix rdfs:   <http://www.w3.org/2000/01/rdf-schema#> .
@prefix rdf:    <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .
@prefix xsd:    <http://www.w3.org/2001/XMLSchema#> .
@prefix module: <http://joseki.org/2003/06/module#> .
@prefix joseki: <http://joseki.org/2005/06/configuration#> .
@prefix ja:     <http://jena.hpl.hp.com/2005/11/Assembler#> .

<#graph1> a ja:MemoryModel
    ; ja:content 
      [ ja:externalContent <file:sampledata.ttl>
      ]
    .

<#graph2> a ja:MemoryModel
    ; ja:content 
      [ ja:externalContent <file:sampledata.ttl>
      ]
    .

<#graph3> a ja:UnionModel
    ; ja:subModel <#graph1>
    ; ja:subModel <#graph2>
    .

<#server> a joseki:Server 
    .

<#service> a  joseki:Service 
    ; joseki:serviceRef "sparql"
    ; joseki:dataset     
        [ a ja:RDFDataset
        ; ja:namedGraph 
            [ ja:graphName <http://example.org/#graph1>    
            ; ja:graph <#graph1>
            ]
        ; ja:namedGraph 
            [ ja:graphName <http://example.org/#graph2>    
            ; ja:graph <#graph2>
            ]
        ; ja:namedGraph 
            [ ja:graphName <http://example.org/#uniongraph>    
            ; ja:graph <#graph3>
            ]
        ]
    ; joseki:processor 
        [ a joseki:Processor
        ; module:implementation 
            [  a joseki:ServiceImpl
            ;  module:className <java:org.joseki.processors.SPARQL>
            ]
        ]
    .

-----------------------------------

When I save this into the Joseki directory and execute
$ export JOSEKIROOT=.
$ bin/rdfserver simpleconfig.ttl

And then browse to 
http://localhost:2020/sparql?query=select+*+where+%7B+GRAPH+%3Fg+%7B+%3Fs+%3Fp+%3Fo+%7D+%7D&output=text

Then I get back some triples. Outstanding.

So! Let's try making this go in the d2r copy of joseki!

=== STEP 2 - Make the simple assembler run in joseki in the d2r2 installation 
===

Ok. The d2r root directory does not have an rdf-server executable. It does have 
a d2r-server executable. This execuable has an option for a d2r mapping file, 
but not for a jena assembler.

But let's take a punt and see if it will accept a jena assembler anyway:

    $ ./d2r-server simpleconfig.ttl 
    15:44:52 WARN  org.eclipse.jetty.webapp.WebAppContext :: Failed startup of 
context o.e.j.w.WebAppContext{,file:/Users/ibis/git/d2rq-0.8.1/webapp/},webapp
    de.fuberlin.wiwiss.d2rq.D2RQException: No d2rq:Database defined in the 
mapping (E1)

Fine, that's what I expected. The d2r launcher launches an endpoint that has 
one graph, that graph being a d2r graph initialised by the parameter. It simply 
doesn't take a general JENA assembler as input.

So lets have a look at the joseki that's inside d2r. The various files in the 
joseki installation are not present. There is a joseki.jar in the lib directory 
and that's it. 

Let's try launching joseki using the same method that rdfserver in the full 
installation launches it. The launch in the full installation is:

    exec "$JAVA" $JAVA_ARGS -cp "$CP" $LOG joseki.rdfserver "$@"

JAVA_ARGS is just "-server -Xmx1G" , which we can ignore.
LOG is -Dlog4j.configuration=${LOGCONFIG}, which we can also ignore
And CP will have to be all the lib files in the joseki directory

  $ CP=$(find lib -name *.jar | while read j ; do echo -n "$j:" ; done)
  $ java -cp $CP joseki.rdfserver simpleconfig.ttl

Exception in thread "main" java.lang.NoClassDefFoundError: 
org/mortbay/jetty/Connector
        at joseki.rdfserver.main(rdfserver.java:85)

Riiiiight. In any case, this at the very least isn't going to work because the 
WEB-INF won't be set up right. The joseki config alone isn't enough - the d2r 
installation isn't set up correctly to expose the joseki SPARQL endpoint as a 
service.

================= NEW PLAN =================

OK! New plan - we will include the d2r libraries in the joseki launch!

Well, there's a fair bit of duplication there, which is not a big problem. 
Except that there are some different versions of some of the libraries, which 
is bad.

Let's just make a humungous classpath and launch Joseki with the D2R libraries 
jammed in there. What could possibly go wrong? First, I'll try putting the 
joseki libraries first.

  $ CP=$(find lib -name \*.jar | while read j ; do echo -n "$j:" ; done):$(find 
~/git/d2rq-0.8.1/lib -name \*.jar | while read j ; do echo -n "$j:" ; done)
  $ java -cp $CP joseki.rdfserver simpleconfig.ttl

SLF4J has a bit of a bitch about a class being in two of the jars, but apart 
from that it's all good.

So - let's try adding in a d2r assembler

Copy simpleconfig.ttl to withd2r.ttl, and add a mapping to the assembler. 

<#graph4> a d2rq:D2RQModel;
    d2rq:mappingFile <simplemapping.ttl>;
    d2rq:resourceBaseURI <http://localhost:2020/test123>;
    .

<#service> a  joseki:Service 
    ; joseki:serviceRef "sparql"
    ; joseki:dataset     
        [ a ja:RDFDataset
        ; ja:namedGraph 
            [ ja:graphName <http://example.org/#graph1>    
            ; ja:graph <#graph1>
            ]
        ; ja:namedGraph 
            [ ja:graphName <http://example.org/#graph2>    
            ; ja:graph <#graph2>
            ]
        ; ja:namedGraph 
            [ ja:graphName <http://example.org/#uniongraph>    
            ; ja:graph <#graph3>
            ]
        ; ja:namedGraph 
            [ ja:graphName <http://example.org/#d2rqgraph>    
            ; ja:graph <#graph4>
            ]
        ]
    ; joseki:processor 
        [ a joseki:Processor
        ; module:implementation 
            [  a joseki:ServiceImpl
            ;  module:className <java:org.joseki.processors.SPARQL>
            ]
        ]
    .

Ok, blows up with a 
  the root file:///Users/ibis/Software/Joseki/Joseki-3.4.4/withd2r.ttl#graph4 
has no most specific type that is a subclass of ja:Object

This is ok - now I understand what that import is for :) . So lets include the 
import

<> ja:imports <http://d2rq.org/terms/d2rq> .

This blows up with a 
  Not found: file:///Users/ibis/Software/Joseki/Joseki-3.4.4/simplemapping.ttl

Which is awesome! The assembler loads and tries to do what I told it to do. 
Booyah! I'm excited, but I've been this excited before and been cruelly 
disappointed, all my hopes dashed. 

So now I need to create a simple d2r mapping file and name it simplemapping.ttl

----------------------------------------------
@prefix map: <#> .
@prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .
@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> .
@prefix xsd: <http://www.w3.org/2001/XMLSchema#> .
@prefix d2rq: <http://www.wiwiss.fu-berlin.de/suhl/bizer/D2RQ/0.1#> .
@prefix jdbc: <http://d2rq.org/terms/jdbc/> .

@prefix d2r: 
<http://sites.wiwiss.fu-berlin.de/suhl/bizer/d2r-server/config.rdf#> .

@prefix nsl: <http://biodiversity.org.au/voc/nsl/NSL#> .

map:Configuration a d2rq:Configuration;
    d2rq:serveVocabulary true
    .

map:APNI_database a d2rq:Database;
        d2rq:jdbcDriver "org.postgresql.Driver";
        d2rq:jdbcDSN "jdbc:postgresql://localhost:5432/nsl";
        d2rq:username "--DELETED--";
        d2rq:password "--DELETED--";
        .

map:APNI_Namespace a d2rq:ClassMap;
        d2rq:dataStorage map:APNI_database;
        d2rq:uriPattern 
"http://biodiversity.org.au/voc/nsl/NamespaceTerm#@@namespace.rdf_id@@";;
        d2rq:class nsl:Namespace;
        d2rq:class nsl:IdentifiedEntity;
        .

--------------------------------------------------

And launch - It launches! Run the query - and it complains that the database is 
down. Of course.

Start the database, run the query ... And I get an error

 INFO [301278115@qtp-1423640817-0] (SPARQL.java:165) - Throwable: Implementing 
class
java.lang.IncompatibleClassChangeError: Implementing class
        at java.lang.ClassLoader.defineClass1(Native Method)
        at java.lang.ClassLoader.defineClass(ClassLoader.java:792)
        at 
java.security.SecureClassLoader.defineClass(SecureClassLoader.java:142)
        at java.net.URLClassLoader.defineClass(URLClassLoader.java:449)
        at java.net.URLClassLoader.access$100(URLClassLoader.java:71)
        at java.net.URLClassLoader$1.run(URLClassLoader.java:361)
        at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
        at java.security.AccessController.doPrivileged(Native Method)
        at java.net.URLClassLoader.findClass(URLClassLoader.java:354)
        at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
        at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:308)
        at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
        at 
de.fuberlin.wiwiss.d2rq.algebra.CompatibleRelationGroup.addNodeRelation(CompatibleRelationGroup.java:53)
        at 
de.fuberlin.wiwiss.d2rq.algebra.CompatibleRelationGroup.groupNodeRelations(CompatibleRelationGroup.java:38)

Damn.

Ok - what if I load up the classpath with the d2r libraries first and the 
joseki libraries second?

  $ CP=$(find ~/git/d2rq-0.8.1/lib -name \*.jar | while read j ; do echo -n 
"$j:" ; done):$(find lib -name \*.jar | while read j ; do echo -n "$j:" ; done)
  $ java -cp $CP joseki.rdfserver withd2r.ttl

and browse to 
http://localhost:2020/sparql?query=select+*+where+%7B+GRAPH+%3Fg+%7B+%3Fs+%3Fp+%3Fo+%7D+%7D&output=text

and 

--------------------------------------------------

OMG! OMG! It works! It works!

So: it seems I can make it go by launching joseki but putting all the d2r 
libraries in the classpath first. It's a little fragile, and what I would like 
is to not to have to do this - to have a prepackaged setup that works. 
Obviously the d2r installation has a bunch of stuff I don't need - the gear 
that supports the d2r web app pages. I still have not confirmed that it will 
co-operate with the multi-gigabyte TDB datasets I need to use. But ... it does 
seem to go. I can write a java app to scan the jar files and find which ones 
declare the same class names, to arrive at a set of jars that don't overlap.

Hopefully, sometime next week I'll be able to show you the new system serving 
the old and the new data together at biodiversity.org.au. But for now, it's 
16:45 on Friday which makes it Beer O'Clock.

------------------------------------------------------------------------------
Dive into the World of Parallel Programming The Go Parallel Website, sponsored
by Intel and developed in partnership with Slashdot Media, is your hub for all
things parallel software development, from weekly thought leadership blogs to
news, videos, case studies, tutorials and more. Take a look and join the 
conversation now. http://goparallel.sourceforge.net/
_______________________________________________
d2rq-map-devel mailing list
d2rq-map-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/d2rq-map-devel

Re: [d2rq-dev] Using the assembler to union d2rq to a TDB model [SEC=UNCLASSIFIED]

Reply via email to