Rép. : Re: New module similar to LDCache

Fabian Cretton Fri, 05 Sep 2014 06:26:02 -0700

Thank you Sergio,
 
I know you are busy, but I hope we can clarify this these days, so that
we can come up with a clear idea of what we will developp, and how.
 
Here is what I understood and what I want to talk about: "All that
infrastructure is provided by the current LDCache module", and, if I get
it right, we would just have to implement new LDClients for new sources,


 
So there is something I don't understand here.
 
I'll try to clarify in more details the differences between
LDCLient\LDCache, and the functionality we want in overLOD, according to
my shallow understanding so far. I haven't played fully with
LDClient\LDCache so far, that is what I will do first thing next
monday.
 
* LDclient(s) are different clients for different data structures (aka
RDFizers), for instance one LDClient for RDFa, one for XML, etc. and
then also for specific data sources and their structures: one for
youtube data, one for facebook. The LDClient knows how to access the
data and how to import them in Marmotta, requiring some transformation
if needed (for instance translate XML to RDF, etc.). This is where we
would, in our project, implement a LDClient for microdata based on
schema.org.
But the LDClient is not specific to one source (maybe except for
facebook/youtube clients and the like). 
In our case, we want to define a LDClient for a specific 'Kind' of
source, let say some rdf files, but then we want to define pointers to
very specific RDF files, for instance 20 of them. To me that would be
done in the higher level, maybe LDCache.
Or do you mean that we create a generic LDClient that is able to import
one kind of data, and then we instanciate 20 times LDClient, one for
each specific source ?
I give here examples for situations we want to handle (and certainly
mainy coming apps based on linked data):
- an way is defined to publish a catalog on the web (ontologies)
-> here, if 30 providers publish according to that way, we want to have
a reference to the 20 of them, and retrieve their data locally (cache)
- or: we could want to import different sets of data from an end-point:
the french cities from DBPedia, the elevations from geoNames, but not
only data where the resources are subject, they might be object too.
 
* Then the LDCache is an automatic and transparent functionality which
will, during queries on the triple store, see that information about a
resource can't be found in the triple store, but should be found on the
web in order to answer the query.  That's what I understand from my
local page here: http://localhost:8080/marmotta/cache/admin/about.html.
and from the LDCache descriptions I found here:
http://marmotta.apache.org/ldcache/index.html.
In order for the LDCache to work, the administrator has to define
LD-Cache endpoints. The LD-Cache will rely on the ressources "prefix" in
order to know which resource to find on which endpoint.
Another information that is different from what we want: "SPARQL
(access to a resource via a SPARQL endpoint): retrieves the triples of a
resource from a defined SPARQL endpoint by issuing a query for all
triples with the resource as subject"
-> we are not solely interested in triples where the resource is
subject
Finally, the LD-Cache will save the triples concerning the resource in
the LD-Cache context. I guess all the triples retrieved by
LD-Cache\LD-Client are saved in a same and single context ?
If so, I do have a question here: how does LD cache identify which
triples in one single cache context needs to be updated ? -> maybe it
takes all the triples which have a specific resource as subject ? or the
timeout is specific to an endpoint ?
 
And so, the functionalities I did describe in my former email, which I
will further developp here, seems to me not implemented yet, and of
great importance for further use of linked data outside of "research"
projects:
Firstly, define precisely data we want to cache (RDF  or any data
handled by a LDClient) to be cached in the server.
It will not be the user SPARQL queries that influence the cached data
(which seems the case with LDCache), but an administrator can very
specifically choose which data to cache (and have control them) -> then
only those validated data are available for the end-user apps based on
that instance of Marmotta.
 
Then we want to have for instance the all content of a file, or the
result of a SPARQL Construct
-> and not just triples where a resource is subject
and, to manage the update of those triples, I think we need one context
per source (otherwise how to know which triples where removed from the
source, etc..)
This seems to me very pretty different from the current management of
LD-Cache endpoints.
 
Another exemple: the LD-Cache endpoints allows to say that resources
with the prefix http://dbpedia.org/resource/
should be found on a specific server. When such a resource is met (in a
SPARQL query or maybe in triples uploaded in the store ?), the LDCache
will query the DBPedia end-point and retrieve all triples where this
resource is subject.
In overLOD: we might want to configure the referencer, so that a SPARQL
construct is run on DBPedia to retrieve all the cities from a specific
country, only the city and its label in french and english (not all
labels!), its population, but also an information where the city is
object of a triple (which might be needed when no inverse properties are
defined)
->those triples will be savec in a specific context
-> they need to be validated
-> we need to know when those information have changed on the server,
and update the cache.
So we can't just set a 'timeout', as we can cache some files that are
never updated (and so no need to reload the data), but also some other
files which are regularly updated.
This 'update' mechanism is a functionality I would hope to talk about
with the Marmotta team, it seems to me more efficient then the current
LD-Cache and its timeout, but I am not sure yet.
 
Aren't there some big difference here, eventhoug the background is
similar ? 
Are those functionalities part of the LDClient (as you suggested) ?
Seems to me some could be implemented in LDClient, but some others in
LDCache (or a new LDCache created for overLOD, which I called last time
'External Data Sources').

Thank you to help me move forward
Fabian

>>> Sergio Fernández<[email protected]> 05.09.2014 09:36 >>>
Hi Fabian,

On 02/09/14 14:09, Fabian Cretton wrote:
> So that would be the goal of the "External data sources" module,
which
> was originaly called "overLOD Referencer" in the document [1]:
> - define precisely RDF data to be cached in the server: that could be
a
> RDF File, a SPARQL CONSTRUCT on a end-point, etc.
> - find a way to validate the content of that data -> here we might
not
> want to reason in an open world assumption, but if a property is
defined
> with a certain range, we would want to check that the objects in the
> file ARE effectively instances from that defined class (for instance
> using SPARQL queries to validate the content, instead of a
reasoner).
> - find a way to manage automatically the updates: it could be a
'pull'
> from Marmotta depending on some VoID data provided by the source, or
the
> source could put in place a "ping" to marmotta, RSS-like features,
like
> it was done by Ping-The-Semantic-Web or Sindice

All that infrastructure is provided by the current LDCache module. If I

got it right, where you actually need to plug-in into this 
infrastructure is at the LDClient level:

   * you can define new LDClient Data Providers for your specific
sources
   * which can wrap all the validation logic you need
   * then LDCache will transparently make use of your LDClient
provider
   * to avoid conflicts with the default providers, they can be
disabled
If that setup fit with your ideas and needs, I'd recommend you to take
a 
look to the current providers:

   https://github.com/apache/marmotta/tree/master/libraries/ldclient

Some of them just do data lifting from others formats (e.g., XML), some

wrap APIs to get RDF out of them (e.g., Facebook), and some do other 
kind of validations and fixes (e.g., the Freebase provides does RDF 
syntax fixing before parsing).

Hope that helps. I guess we have to provide better documentations and 
diagrams to understand the infrastructure LDClient+LDCache provide.

Cheers,

-- 
Sergio Fernández
Partner Technology Manager
Redlink GmbH
m: +43 660 2747 925
e: [email protected]
w: http://redlink.co

Rép. : Re: New module similar to LDCache

Reply via email to