Re: linked open data and PDF

Norman Gray Wed, 21 Jan 2015 09:21:18 -0800

Paul and Rod, hello.

> On 2015 Jan 21, at 16:32, Paul Houle <[email protected]> wrote:
> 
>       I think the world needs a survey of XMP metadata in the field.  Only by 
> inspection of a large set of diverse files can we say how good or bad the 
> situation actually is.


Rod's link at <http://rossmounce.co.uk/2013/01/06/pdf-metadata-using-exiftool/> 
is very interesting, and possibly encouraging.

>       There ought to be a tool that gives XMP-annotated documents a point 
> score for metadata quality;  you ought to get a lot of points for having the 
> simple things that were missing in the document exported from word like the 
> title, author,  copyright,  etc.

Now, that's a _Really_ good idea!  And just to prove how simple it is to do 
something crude:

% ./extract-xmp test-xmp.pdf | rapper -irdfxml -ontriples - test-xmp.pdf | 
python score-rdf.py 
...
11 triples found; metadata-goodness-score=12

This is with the python script included at the bottom.

>       Note it is not just about PDF but many kinds of media files that are 
> tagged with this,  so it really is about XMP,  not just PDF.

Very much so.

(also it's not even really about XMP; there are all sorts of ways of scraping 
metadata out of objects and turning it into something which an RDF parser can 
read, and from that point you can start being imaginative.  This is of course 
stupidly obvious to everyone on this list, but it's an aha! that many people 
haven't got yet).

All the best,

Norman



#! /usr/bin/python

# score RDF for metadata goodness
#
# Usage:
#
#    ./extract-xmp test-xmp.pdf | rapper -irdfxml -ontriples - test-xmp.pdf | 
python score-rdf.py 

import sys, re

ntline = re.compile('(?:<([^>]*)>|(_:[^ ]*)) *<([^>]*)> *(.*)')

scores = {'http://purl.org/dc/elements/1.1/creator': 1,
          'http://purl.org/dc/elements/1.1/title': 1,
          'http://purl.org/dc/elements/1.1/created': 1,
          'http://purl.org/dc/elements/1.1/abstract': 2,
          'http://ns.adobe.com/xap/1.0/rights/Marked': 3,
          'http://creativecommons.org/ns#license': 4
          }

no_triples = 0
score = 0

for line in sys.stdin:
    m = ntline.match(line)
    if m:
        bits = m.groups()
        print('{}  /  {}\n\t{}\n\t{}\n'.format(bits[0],
                                               bits[1],
                                               bits[2],
                                               bits[3]))

        no_triples = no_triples + 1
        pred = bits[2]
        if pred in scores:
            score += scores[pred]
    else:
        print("---didn't match {}".format(line))

print('{} triples found; metadata-goodness-score={}'.format(no_triples, score))

-- 
Norman Gray  :  http://nxg.me.uk
SUPA School of Physics and Astronomy, University of Glasgow, UK

Re: linked open data and PDF

Reply via email to