Re: dataset branching (a la git)

Kingsley Idehen Thu, 02 Oct 2014 16:06:31 -0700

On 10/2/14 6:19 PM, Jürgen Jakobitsch wrote:

ok - i guess i should come up with an example :
what i want to achieve is for example that people can rewrite part of a dataset and be able to get their version of the complete dataset.


Okay.


i.e. (java code)

i clone a whole repository, change one single line in one java file and still be able to compile the whole project.


i.e. (rdf code)

master data (in graph http://graphs.net/master) (a flat list)

<http://s.org/a> <http://p.net/label> "europe" .
<http://s.org/b> <http://p.net/label> "central europe" .
<http://s.org/c> <http://p.net/label> "austria" .
<http://s.org/d> <http://p.net/label> "carinthia" .
<http://s.org/e> <http://p.net/label> "klagenfurt" .
<http://s.org/f> <http://p.net/label> "st.martin" .

Nanotation [1] markers for generating sample data from this post, if required further on in the discussion.


## Nanotation Start ##

</document1>
<#europe> <#label> "europe" .
<#centralEurope> <#label> "central europe" .
<#austria> <#label> "austria" .
<#carinthia> <#label> "carinthia" .
<#klagenfurt> <#label> "klagenfurt" .
<#stMarting> <#label> "st.martin" .

## Nanotation End ##

person A (in graph http://graphs.net/persons/a) (= a branch with a hierarchy) (note : person A is at time T1 not an expert and doesn't know about "carinthia" being an austrian state)
<http://s.org/a> skos:narrower <http://s.org/b> .
<http://s.org/b> skos:narrower <http://s.org/c> .
<http://s.org/c> skos:narrower <http://s.org/e> .

## Nanotation Start ##

</document2>
<#europe> skos:narrower <#uk>.
<#centralEurope> skos:narrower <#bulgaria> .
<#austria> skos:narrower <#vienna> .

## Nanotation End ##

person B (in graph http://graphs.net/persons/b) (= a branch with a [better] hierarchy) (note : person B is an expert on austrian geography and knows about "carinthia" being an austrian state)
<http://s.org/a> skos:narrower <http://s.org/b> .
<http://s.org/b> skos:narrower <http://s.org/c> .
<http://s.org/c> skos:narrower <http://s.org/d> .
<http://s.org/d> skos:narrower <http://s.org/e> .


## Nanotation Start ##
## Vienna and Carinthia conflict

</document3>
<#austria> skos:narrower <#carinthia> .

## Nanotation End ##

what happend becomes clear when take one step back and realize that all the relations (skos:narrower) have been duplicated.
now say person C is a senior expert on the municipalities andboroughs in the city of "klagenfurt". person C agrees with the graph from person B but wants to extend it. in this simple example person => could <= simply add triples in http://graphs.net/persons/c beginning with <http://s.org/e> skos:narrower <http://s.org/f> .
and i could select do a

SELECT
FROM  <http://graphs.net/master>
FROM  <http://graphs.net/persons/b>
FROM  <http://graphs.net/persons/c>

to get complete and happy result.


SELECT *
FROM </document1>
FROM </document2>
FROM </document3>
WHERE { ?s ?p ?o .
             VALUES
             FILTER (NOT EXISTS {<#austria> skos:narrower <#vienna> } )
           }

OR

## Using NOT FROM extension we've implemented

SELECT *
NOT FROM </document2>
WHERE { ?s ?p ?o . }


There are other options.


now, besides copying triples like
<http://s.org/a> skos:narrower <http://s.org/b> .
<http://s.org/b> skos:narrower <http://s.org/c> .
this example works when appending to the end of the hierarchy.

what you cannot simply do is for example replace a triple in a branch (graph)


But you can filter out a named graph.

Of course there's more, I could even generate live data from the Nanotations embedded in this post, but that's a last resort. I have a like example of triples created via nanotation laced tweets that might demonstrate this shuffling in and out of named graphs used in a SPARQL processing pipeline [2][3][4][5][6][7].

say person D agrees with person B mostly, only "Central Europe" is no political entity and therefor doesn't have to do anything in the hierarchy.
person D could actually only copy the graph and adjust the triples accordingly (but that is again copying)
now this copying i don't like.

let's come back to the initial example of a biological classification.
i just triplified the catalogoflife.org <http://catalogoflife.org> downloadable dataset and currently have 1775844 entities and with a couple of different opions from
a couple of different scientists this soon goes into billions of triples.
;-) i still should think about how express the problem that i see but i need to start somewhere and writing such things down really helps sometimes..
wkr j


Hopefully, this illustrates your fundamental quest?

Links:

[1] http://bit.ly/blog-post-about-nanotation
[2] http://linkeddata.uriburner.com/c/9GDYGU3 -- Everything

[3] http://linkeddata.uriburner.com/fct/rdfdesc/usage.vsp?g=https%3A%2F%2Ftwitter.com%2Fhashtag%2FNoSilo%23this -- all the named graphs contributing to the SPARQL solution behind this page [4] http://linkeddata.uriburner.com/c/9CJLOKIL -- same page with a specific named graph (internal document DB id/name) designated as the data source [5] http://linkeddata.uriburner.com/fct/rdfdesc/usage.vsp?g=https%3A%2F%2Ftwitter.com%2Fhashtag%2FNoSilo%23this -- shows the designated named graph data source (hatched in the UI) [6] http://linkeddata.uriburner.com/fct/rdfdesc/usage.vsp?g=https%3A%2F%2Ftwitter.com%2Fhashtag%2FNoSilo%23this -- two named graphs specifically designated as data sources [7] http://linkeddata.uriburner.com/c/9CT5GRUZ -- effect of the two named graphs specifically designated as data sources .


Kingsley

2014-10-02 23:42 GMT+02:00 Kingsley Idehen <[email protected] <mailto:[email protected]>>:


    On 10/2/14 4:02 PM, Jürgen Jakobitsch wrote:

    hi,

    when trying to classify the animals on pictures from a recent
    trip to eastern indonesia
    meticulously realized that it is very hard if not impossible to
    branch datasets with ease.
    while this might sound ignoreable at first sight it might as well
    be the reason for the giant global graph to develop a culture of
    duplicating and linking with the end effect of being very close
    to where we came from (many sql databases).

    what i mean will hopefully become clear with a simple example :

    the "manta birostris" (giant oceanic manta ray) is classified

    her wikipedia.org <http://wikipedia.org> as
    Kingdom:Animalia
    Phylum:Chordata
    Class:Chondrichthyes
    Subclass:Elasmobranchii
    Order:Myliobatiformes
    Suborder:Myliobatidae
    Family:Mobulidae
    Genus:Manta
    Species:Manta birostris

    here http://www.catalogueoflife.org/col/browse/tree/id/18879368 as
    Kingdom: Animalia
    Phylum: Chordata
    Class: Elasmobranchii
    Order: Myliobatiformes
    Family: Myliobatidae
    Genus: Manta
    Species: Manta birostris

    here http://www.marinespecies.org/aphia.php?p=browser&id=105755#ct as
    Kingdom: Animalia
    Phylum: Chordata
    Subphylum: Vertebrata
    Superclass: Gnathostomata
    Superclass Pisces (Unreviewed)
    Class: Elasmobranchii (Unreviewed)
    Subclass: Neoselachii (Unreviewed)
    Infraclass: Batoidea (Unreviewed)
    Order: Rajiformes
    Family: Myliobatidae (Unreviewed)
    Subfamily: Mobulinae
    Genus: Manta
    Species: Manta birostris

    here http://data.gbif.org/species/2419163/ as
    Kingdom: Animalia
    Phylum: Chordata
    Class: Elasmobranchii
    Order: Myliobatiformes
    Family: Myliobatidae
    Genus: Manta
    Species: Manta birostris

    if only in theory we would triplify all these datasets and link
    them it still would be very hard to find out what different
    people think about the actually same being.

    now:

    my thinking was to create a flat list of uris for => all <= these
    classifications and create branches (graphs) with the
    hierarchies. but it is not as simple as it sounds because i
    cannot make the sparql engine follow a branch at certain uris and
    the rejoin the master graph again by whatever means.


    You mean that you can't de-reference a SPARQL query pattern
    variable as part of a SPARQL query processing pipeline?

    neither can i do such things on data level.


    If the data is in 5-star Linked Open Data form you have the data
    network in place. Then its about a SPARQL query that crawls the
    data-network. Ultimately, each entity description document SHOULD
    end up being an internal triples/quad store document identifier
    (a/k/a named graph IRI).

    Naturally, what I describe above is how Virtuoso will behave is
    you include input:grab pragmas in your SPARQL.


    i was thinking about like so [1] on a triple (quad) level.

    questions:

    1. is the problem described so that it is at least
    semi-understandable (or should i come up with some triples as
    example)


    I think so, but not 100% certain :)

    2. has this problem already been dealt with and i was only
    missing that day (please provide a link)


    Sorta, in some other conversations about LOD cloud crawling and
    SPARQL.

    3. has this problem already been solved and i was only missing
    that day (please provide a link)
    4. do you think it is worth dealing with
        (i personally think so [think: scaling cooperation ])
    5. would be a of enough interest to create a wg

    any pointers and thoughts highly appreciated
    wkr turnguard

-- Regards,


    Kingsley Idehen     
    Founder & CEO
    OpenLink Software
    Company Web:http://www.openlinksw.com
    Personal Weblog 1:http://kidehen.blogspot.com
    Personal Weblog 2:http://www.openlinksw.com/blog/~kidehen  
<http://www.openlinksw.com/blog/%7Ekidehen>
    Twitter Profile:https://twitter.com/kidehen
    Google+ Profile:https://plus.google.com/+KingsleyIdehen/about
    LinkedIn Profile:http://www.linkedin.com/in/kidehen
    Personal WebID:http://kingsley.idehen.net/dataspace/person/kidehen#this



--
Regards,

Kingsley Idehen 
Founder & CEO
OpenLink Software
Company Web: http://www.openlinksw.com
Personal Weblog 1: http://kidehen.blogspot.com
Personal Weblog 2: http://www.openlinksw.com/blog/~kidehen
Twitter Profile: https://twitter.com/kidehen
Google+ Profile: https://plus.google.com/+KingsleyIdehen/about
LinkedIn Profile: http://www.linkedin.com/in/kidehen
Personal WebID: http://kingsley.idehen.net/dataspace/person/kidehen#this

smime.p7s
Description: S/MIME Cryptographic Signature

Re: dataset branching (a la git)

Reply via email to