The RDF Pipeline Framework also has a perl script that reads RDF (Turtle), figures out the data's implied schema -- classes and predicates -- and outputs a summary. The code is open source (Apache 2.0 licensed) and resides on github:
https://github.com/dbooth-boston/rdf-pipeline/blob/master/tools/summarize-rdf

It is *not* very efficient, so at present it is not suitable for large RDF datasets. (It could be made more efficient, but no effort has been put into that yet.) An opening comment in the code explains the output:
[[
# Runtime: ~30 minutes / 600k triples on a 2012 laptop (quad processor)
#
# EXAMPLE INPUT:
# 1. @prefix p: <http://purl.org/pipeline/ont#> .
# 2. @prefix : <http://localhost/node/> .
# 3. :max a p:FileNode . # No updater -- update manually.
# 4. :odds a p:FileNode ;
# 5. p:inputs ( :max ) ;
# 6. p:updater "odds-updater" .
# 7. :mult a p:FileNode ;
# 8. p:inputs ( :odds <http://localhost/node/multiplier.txt> ) ;
# 9. p:updater "mult-updater" .
# 10. :addone a p:FileNode ;
# 11. p:inputs ( :mult ) ;
# 12. p:updater "addone-updater" .
# 13. p:URI <http://www.w3.org/2000/01/rdf-schema#subClassOf> p:Node .
#
# EXAMPLE OUTPUT:
# 1. ===== Input Summary =====
# 2. Parsing turtle: /tmp/jin.ttl
# 3. Total triples: 19
# 4. Nodes by kind: BLANK 4 LITERAL 3 URI 7
# 5. Literals by datatype: UNTYPED 3
# 6.
# 7. ===== Predicates by Subject Class =====
# 8. p1:FileNode 4
# 9. p1:inputs 3 -> { rdf:List 3 } 3
# 10. p1:updater 3 -> { (UNTYPED) 3 } 3
# 11. rdf:type 4 -> { rdfs:Class 1 } 1
# 12.
# 13. rdf:List 4
# 14. rdf:first 4 -> { p1:FileNode 3 UNKNOWN 1 } 4
# 15. rdf:rest 4 -> { rdf:List 2 } 2
# 16.
# 17. rdfs:Class 1
# 18. rdfs:subClassOf 1 -> { rdfs:Class 1 } 1
# 19.
# 20. * Indicates a root class, whose instances are never objects.
# 21.
# 22. ===== Namespaces =====
# 23. PREFIX p1: <http://purl.org/pipeline/ont#>
# 24. PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
# 25. PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
#
# EXPLANATION OF OUTPUT:
# Numbers are instance counts. Braces group a list of classes,
# because things can have more than one class.
#
# For brevity, class and predicate names have been shortened by
# stripping a presumed namespace (though not necessarily a namespace
# that you declared using @prefix). Namespace prefixes are listed
# at the end (on lines 23-25), but they are not necessarily the
# same as the prefixes that were used in the input turtle, because
# the original prefixes are lost in parsing.
#
# Line 4 shows the number of distinct blank nodes, literals and URIs.
#
# Line 5 breaks down the literals by datatype, showing the number
# of distinct instances for each datatype.
#
# Line 8 indicates that there were 4 distinct p1:FileNode instances
# in the subject position of a triple.
#
# Line 9 indicates that the domain of p1:inputs included p1:FileNode,
# range included rdf:List, there were 3 triples having a p1:FileNode
# instance in the subject position and p1:inputs as predicate, and
# there were 3 distinct rdf:List values in the object position of a triple.
#
# Line 10 indicates that the range of p1:updater was a set of
# untyped literal values, and there were 3 distinct literals.
# A datatype range (as opposed to a class range) is indicated
# in parentheses. It also indiates that there were 3 triples
# with subject class p1:FileNode and predicate p1:updater.
#
# Line 14 indicates that the rdf:first predicate has a range that
# includes both the p1:FileNode class (having 3 instances) and
# an unknown class (having 1 instance), for a total of 4
# distinct instances. In this case, the unknown class was due
# to <http://localhost/node/multiplier.txt> on input line 8,
# as it was not declared with any rdf:type . Remember that
# rdf:first" and rdf:rest are auto-generated from Turtle
# list syntax ( ... ).
#
# Line 17 indicates that there was one distinct instance of class
# rdf:Class as a subject.
#
# Line 20: In this example there were no root classes, i.e.,
# classes whose instances never appear in the object position,
# but if there had been, then each would have been marked
# with an asterisk.
]]

David Booth

On 02/04/2015 03:49 PM, Michael F Uschold wrote:
Sorry, ignore priore email, it was sent prematurely.

We had occasion to need the ability to eplore a triple store in an
application we were building for a client using a triple store (TS).
Triples were being created using scripts and being updated into the
TS,we also had an application that allowed users to enter information
which added more triples.  All of this was backed by an ontology that
was evolving. It was pretty tricking knowing what parts of the ontology
were being exercised and which were not.  So we wrote some SPARQL
queries that produced a table where each row said something like this:
There are 543 triples where the subject is  of type Person and the
predicate is employedBy and the object is of type Organization.
The table looked a bit like this:

Subject        Predicate         Object         Count
Person          hasEmployer    Organization 2344
Organization locatedIn        GeoRegion       432

We found this to be extremely useful, not only to see exactly what was
being used, but also how much as well as what was NOT being used, which
were candidates for removing from the ontology.  The SPARQL queries are
not simple to write, but they are not too bad either. Some of the other
responses spoke of similar things.

This is more specialized than the original question, which was to find
out what the ontology was.   Here were were more concerned about which
parts of the ontology were being used.

Michael


On Wed, Feb 4, 2015 at 12:42 PM, Michael F Uschold <[email protected]
<mailto:[email protected]>> wrote:

    We had occasion to need this ability on an application we were
    building for a client using a triple store (TS). Triples were being
    created using scripts and being updated into the TS,we also had an
    application that allowed users to enter information which added more
    triples.  All of this was backed by an ontology that was evolving.
    It was pretty tricking knowing what parts of the ontology were being
    exercised and which were not.  So we wrote some SPARQL queries that
    produced a table where each row said something like this:
    There are 543 triples where the subject is  of type Person and the
    predicate is employedBy and the object is of type Organization.
    A row looked like this:

    Subject

    On Wed, Feb 4, 2015 at 11:35 AM, Lushan Han <[email protected]
    <mailto:[email protected]>> wrote:

        This work [1] might be helpful to some people. It automatically
        learns a "schema" from a given RDF dataset, including most
        probable classes and properties and most probable
        relations/paths between given classes and etc. Next, it can
        automatically translate a casual user's intuitive graph query or
        schema-free query to a formal SPARQL query using the learned
        schema and statistical NLP techniques, like textual semantic
        similarity.

        [1]
        
http://ebiquity.umbc.edu/paper/html/id/658/Schema-Free-Querying-of-Semantic-Data


        Cheers,

        Lushan

        On Sun, Jan 25, 2015 at 11:32 PM, Pavel Klinov
        <[email protected] <mailto:[email protected]>> wrote:

            On Sun, Jan 25, 2015 at 11:44 PM, Bernard Vatant
            <[email protected]
            <mailto:[email protected]>> wrote:
            > Hi Pavel
            >
            > Very interesting discussion, thanks for the follow-up.. Some 
quick answers

            > below, but I'm currently writing a blog post which will go in 
more details
            > on the notion of Data Patterns, a term I've been pushing last 
week on the DC
            > Architecture list, where it seems to have gained some traction.
            > 
Seehttps://www.jiscmail.ac.uk/cgi-bin/webadmin?A1=ind1501&L=dc-architecture
            > for the discussion.

            OK, thanks for the link, will check it out. I agree that the
            patterns
            is perhaps a better term than "schema" since by the latter
            people
            typically mean explicit specification. I guess it's my use
            of the term
            "schema" which created some confusion initially.

            >> ... which reflects what the
            >> data is all about. Knowing such structure is useful (and often
            >> necessary) to be able to write meaningful queries and that's, I 
think,
            >> what the initial question was.
            >
            >
            > Certainly, and I would rewrite this question : How do you find 
out data
            > patterns in a dataset?

            I think it's a more general and tough question having to do
            with data
            mining. Not sure that anyone would venture into finding out data
            patterns against a public endpoint just to be able to write
            queries
            for it.

            >
            >>
            >> When such structure exists, I'd say
            >> that the dataset has an *implicit* schema (or a conceptual 
model, if
            >> you will).
            >
            >
            > Well, that's where I don't follow. If data, as it happens more 
and more, is
            > gathered from heterogeneous sources, the very notion of a 
conceptual model
            > is jumping to conclusions.

            A merger of structures is still a structure. By anyways,
            I've already
            agreed to say patterns =)

            > In natural languages, patterns often precede the
            > grammar describing them, even if the patterns described in the 
grammar at
            > some point become prescriptive rules. Data should be looked at 
the same way.

            Not sure. I won't immediately disagree since I don't have
            statistics
            regarding structured/unstructured datasets out there.

            >>
            >> What is absent is an explicit representation of the schema,
            >> or the conceptual model, in terms of RDFS, OWL, or SKOS axioms.
            >
            >
            > When the dataset gathers various sources and various 
vocabularies, such a
            > schema does not exists, actually.

            Not necessarily. Parts of it may exist. Take yago, for
            example. It's
            derived from a bunch of sources including Wikipedia and
            GeoNames and
            yet offers its schema for a separate download.

            >> However, when the schema *is* represented explicitly, knowing it 
is a
            >> huge help to users which otherwise know little about the data.
            >
            >
            > OK, but the question is : which is a good format for exposing this
            > structure?
            > RDFS/OWL ontology/vocabulary, Application Profiles, RDF Shapes / 
whatever it
            > will be named, or ... ?

            I think this question is a bit secondary. If the need were
            recognized,
            this could be, at least in theory, agreed on.

            >>
            >> PPS. It'd also be correct to claim that even when a structure 
exists,
            >> realistic data can be messy and not fit into it entirely. We've 
seen
            >> stuff like literals in the range of object properties, etc. It's 
a
            >> separate issue having to do with validation, for which there's an
            >> ongoing effort at W3C. However, it doesn't generally hinder 
writing
            >> queries which is what we're discussing here.
            >
            >
            > Well I don't see it as a separate issue. All the raging debate 
around RDF
            > Shapes is not (yet) about validation, but on the definition of 
what a
            > shape/structure/schema can be.

            OK, won't disagree on this.

            Thanks,
            Pavel

             >
             >
             >>
             >> > Since the very notion of schema for RDF data has no
            meaning at all,
             >> > and the absence of schema is a bit frightening, people
            tend to give it a
             >> > lot
             >> > of possible meanings, depending on your closed world
            or open world
             >> > assumption, otherwise said if the "schema" will be
            used for some kind of
             >> > inference or validation. The use of "Schema" in RDFS
            has done nothing to
             >> > clarify this, and the use of "Ontology" in OWL added a
            layer of
             >> > confusion. I
             >> > tend to say "vocabulary" to name the set of types and
            predicates used by
             >> > a
             >> > dataset (like in Linked Open Vocabularies), which is a
            minimal
             >> > commitment to
             >> > how it is considered by the dataset owner, bearing in
            mind that this
             >> > "vocabulary" is generally a mix of imported terms from
            SKOS, FOAF,
             >> > Dublin
             >> > Core ... and home-made ones. Which is completely OK
            with the spirit of
             >> > RDF.
             >> >
             >> > The brand new LDOM [1] or whatever it ends up to be
            named at the end of
             >> > the
             >> > day might clarify the situation, or muddle those
            waters a bit more :)
             >> >
             >> > [1] http://spinrdf.org/ldomprimer.html
             >> >
             >> > 2015-01-23 10:37 GMT+01:00 Pavel Klinov
            <[email protected] <mailto:[email protected]>>:
             >> >>
             >> >> Alright, so this isn't an answer and I might be
            saying something
             >> >> totally silly (since I'm not a Linked Data person,
            really).
             >> >>
             >> >> If I re-phrase this question as the following: "how
            do I extract a
             >> >> schema from a SPARQL endpoint?", then it seems to pop
            up quite often
             >> >> (see, e.g., [1]). I understand that the original
            question is a bit
             >> >> more general but it's fair to say that knowing the
            schema is a huge
             >> >> help for writing meaningful queries.
             >> >>
             >> >> As an outsider, I'm quite surprised that there's
            still no commonly
             >> >> accepted (i'm avoiding "standard" here) way of doing
            this. People
             >> >> either hope that something like VoID or LOV
            vocabularies are being
             >> >> used, or use 3-party tools, or write all sorts of ad
            hoc SPARQL
             >> >> queries themselves, looking for types, object properties,
             >> >> domains/ranges etc-etc. There are also papers written
            on this subject.
             >> >>
             >> >> At the same time, the database engines which host
            datasets often (not
             >> >> always) manage the schema separately from the data.
            There're good
             >> >> reasons for that. One reason, for example, is to be
            able to support
             >> >> basic reasoning over the data, or integrity
            validation. Just because
             >> >> in RDF the schema language and the data language are
            the same, so
             >> >> schema and data triples can be interleaved, it need
            not (and often
             >> >> not) be managed that way.
             >> >>
             >> >> Yet, there's no standard way of requesting the schema
            from the
             >> >> endpoint, and I don't quite understand why. There's
            the SPARQL 1.1
             >> >> Service Description, which could, in theory, cover
            it, but it doesn't.
             >> >> Servicing such schema extraction requests doesn't
            have to be mandatory
             >> >> so the endpoints which don't have their schemas right
            there don't have
             >> >> to sift through the data. Also, schemas are typically
            quite small.
             >> >>
             >> >> I guess there's some problem with this which I'm
            missing...
             >> >>
             >> >> Thanks,
             >> >> Pavel
             >> >>
             >> >> [1]
             >> >>
             >> >>
            
http://answers.semanticweb.com/questions/25696/extract-ontology-schema-for-a-given-sparql-endpoint-data-set
             >> >>
             >> >> On Thu, Jan 22, 2015 at 3:09 PM, Juan Sequeda
            <[email protected] <mailto:[email protected]>>
             >> >> wrote:
             >> >> > Assume you are given a URL for a SPARQL endpoint.
            You have no idea
             >> >> > what
             >> >> > data
             >> >> > is being exposed.
             >> >> >
             >> >> > What do you do to explore that endpoint? What
            queries do you write?
             >> >> >
             >> >> > Juan Sequeda
             >> >> > +1-575-SEQ-UEDA
             >> >> > www.juansequeda.com <http://www.juansequeda.com>
             >> >>
             >> >
             >> >
             >> >
             >> >
             >
             >
             > --
             > Bernard Vatant
             > Vocabularies & Data Engineering
             > Tel : + 33 (0)9 71 48 84 59
            <tel:%2B%2033%20%280%299%2071%2048%2084%2059>
             > Skype : bernard.vatant
             > http://google.com/+BernardVatant
             > --------------------------------------------------------
             > Mondeca
             > 35 boulevard de Strasbourg 75010 Paris
             > www.mondeca.com <http://www.mondeca.com>
             > Follow us on Twitter : @mondecanews
             > ----------------------------------------------------------





    --

    Michael Uschold
    Senior Ontology Consultant, Semantic Arts
    http://www.semanticarts.com <http://www.semanticarts.com/>
        LinkedIn:http://tr.im/limfu
        Skype, Twitter: UscholdM






--

Michael Uschold
Senior Ontology Consultant, Semantic Arts
http://www.semanticarts.com <http://www.semanticarts.com/>
    LinkedIn:http://tr.im/limfu
    Skype, Twitter: UscholdM




Reply via email to