Thank you Sergio, I know you are busy, but I hope we can clarify this these days, so that we can come up with a clear idea of what we will developp, and how. Here is what I understood and what I want to talk about: "All that infrastructure is provided by the current LDCache module", and, if I get it right, we would just have to implement new LDClients for new sources,
So there is something I don't understand here. I'll try to clarify in more details the differences between LDCLient\LDCache, and the functionality we want in overLOD, according to my shallow understanding so far. I haven't played fully with LDClient\LDCache so far, that is what I will do first thing next monday. * LDclient(s) are different clients for different data structures (aka RDFizers), for instance one LDClient for RDFa, one for XML, etc. and then also for specific data sources and their structures: one for youtube data, one for facebook. The LDClient knows how to access the data and how to import them in Marmotta, requiring some transformation if needed (for instance translate XML to RDF, etc.). This is where we would, in our project, implement a LDClient for microdata based on schema.org. But the LDClient is not specific to one source (maybe except for facebook/youtube clients and the like). In our case, we want to define a LDClient for a specific 'Kind' of source, let say some rdf files, but then we want to define pointers to very specific RDF files, for instance 20 of them. To me that would be done in the higher level, maybe LDCache. Or do you mean that we create a generic LDClient that is able to import one kind of data, and then we instanciate 20 times LDClient, one for each specific source ? I give here examples for situations we want to handle (and certainly mainy coming apps based on linked data): - an way is defined to publish a catalog on the web (ontologies) -> here, if 30 providers publish according to that way, we want to have a reference to the 20 of them, and retrieve their data locally (cache) - or: we could want to import different sets of data from an end-point: the french cities from DBPedia, the elevations from geoNames, but not only data where the resources are subject, they might be object too. * Then the LDCache is an automatic and transparent functionality which will, during queries on the triple store, see that information about a resource can't be found in the triple store, but should be found on the web in order to answer the query. That's what I understand from my local page here: http://localhost:8080/marmotta/cache/admin/about.html. and from the LDCache descriptions I found here: http://marmotta.apache.org/ldcache/index.html. In order for the LDCache to work, the administrator has to define LD-Cache endpoints. The LD-Cache will rely on the ressources "prefix" in order to know which resource to find on which endpoint. Another information that is different from what we want: "SPARQL (access to a resource via a SPARQL endpoint): retrieves the triples of a resource from a defined SPARQL endpoint by issuing a query for all triples with the resource as subject" -> we are not solely interested in triples where the resource is subject Finally, the LD-Cache will save the triples concerning the resource in the LD-Cache context. I guess all the triples retrieved by LD-Cache\LD-Client are saved in a same and single context ? If so, I do have a question here: how does LD cache identify which triples in one single cache context needs to be updated ? -> maybe it takes all the triples which have a specific resource as subject ? or the timeout is specific to an endpoint ? And so, the functionalities I did describe in my former email, which I will further developp here, seems to me not implemented yet, and of great importance for further use of linked data outside of "research" projects: Firstly, define precisely data we want to cache (RDF or any data handled by a LDClient) to be cached in the server. It will not be the user SPARQL queries that influence the cached data (which seems the case with LDCache), but an administrator can very specifically choose which data to cache (and have control them) -> then only those validated data are available for the end-user apps based on that instance of Marmotta. Then we want to have for instance the all content of a file, or the result of a SPARQL Construct -> and not just triples where a resource is subject and, to manage the update of those triples, I think we need one context per source (otherwise how to know which triples where removed from the source, etc..) This seems to me very pretty different from the current management of LD-Cache endpoints. Another exemple: the LD-Cache endpoints allows to say that resources with the prefix http://dbpedia.org/resource/ should be found on a specific server. When such a resource is met (in a SPARQL query or maybe in triples uploaded in the store ?), the LDCache will query the DBPedia end-point and retrieve all triples where this resource is subject. In overLOD: we might want to configure the referencer, so that a SPARQL construct is run on DBPedia to retrieve all the cities from a specific country, only the city and its label in french and english (not all labels!), its population, but also an information where the city is object of a triple (which might be needed when no inverse properties are defined) ->those triples will be savec in a specific context -> they need to be validated -> we need to know when those information have changed on the server, and update the cache. So we can't just set a 'timeout', as we can cache some files that are never updated (and so no need to reload the data), but also some other files which are regularly updated. This 'update' mechanism is a functionality I would hope to talk about with the Marmotta team, it seems to me more efficient then the current LD-Cache and its timeout, but I am not sure yet. Aren't there some big difference here, eventhoug the background is similar ? Are those functionalities part of the LDClient (as you suggested) ? Seems to me some could be implemented in LDClient, but some others in LDCache (or a new LDCache created for overLOD, which I called last time 'External Data Sources'). Thank you to help me move forward Fabian >>> Sergio Fernández<[email protected]> 05.09.2014 09:36 >>> Hi Fabian, On 02/09/14 14:09, Fabian Cretton wrote: > So that would be the goal of the "External data sources" module, which > was originaly called "overLOD Referencer" in the document [1]: > - define precisely RDF data to be cached in the server: that could be a > RDF File, a SPARQL CONSTRUCT on a end-point, etc. > - find a way to validate the content of that data -> here we might not > want to reason in an open world assumption, but if a property is defined > with a certain range, we would want to check that the objects in the > file ARE effectively instances from that defined class (for instance > using SPARQL queries to validate the content, instead of a reasoner). > - find a way to manage automatically the updates: it could be a 'pull' > from Marmotta depending on some VoID data provided by the source, or the > source could put in place a "ping" to marmotta, RSS-like features, like > it was done by Ping-The-Semantic-Web or Sindice All that infrastructure is provided by the current LDCache module. If I got it right, where you actually need to plug-in into this infrastructure is at the LDClient level: * you can define new LDClient Data Providers for your specific sources * which can wrap all the validation logic you need * then LDCache will transparently make use of your LDClient provider * to avoid conflicts with the default providers, they can be disabled If that setup fit with your ideas and needs, I'd recommend you to take a look to the current providers: https://github.com/apache/marmotta/tree/master/libraries/ldclient Some of them just do data lifting from others formats (e.g., XML), some wrap APIs to get RDF out of them (e.g., Facebook), and some do other kind of validations and fixes (e.g., the Freebase provides does RDF syntax fixing before parsing). Hope that helps. I guess we have to provide better documentations and diagrams to understand the infrastructure LDClient+LDCache provide. Cheers, -- Sergio Fernández Partner Technology Manager Redlink GmbH m: +43 660 2747 925 e: [email protected] w: http://redlink.co
