Re: [CODE4LIB] Python or Perl script for reading RDF/XML, Turtle, or N-triples Files

Owen Stephens Tue, 30 Sep 2014 10:44:31 -0700

I've not tried using the LCNAF RDF files, and I've not used RDFLib, but a 
couple of things from (a relatively small amount of) experience parsing RDF:


Don't try to parse the RDF/XML, use n-triples instead
As Kyle mentioned, you might want to use command line tools to strip down the 
n-triples to only deal with data you actually want
Rapper and the Redland RDF libraries are a good place to start, and have 
bindings to Perl, PHP, Python and Ruby (http://librdf.org/raptor/rapper.html 
and http://librdf.org). This StackOverflow Q&A might help getting started 
http://stackoverflow.com/questions/5678623/how-to-parse-big-datasets-using-rdflib
If you want to move between RDF formats an alternative to Rapper is 
http://www.l3s.de/~minack/rdf2rdf/ - this succeeded converting a file of 48 
million triples in ttl to ntriples where Rapper failed with an 'out of memory' 
error (once in ntriples, Rapper can be used for further parsing)


Some slightly random advice there, but maybe some of it will be useful!

Owen

Owen Stephens
Owen Stephens Consulting
Web: http://www.ostephens.com
Email: o...@ostephens.com
Telephone: 0121 288 6936

On 30 Sep 2014, at 15:54, Jeremy Nelson <jeremy.nel...@coloradocollege.edu> 
wrote:

> Hi Jean,
> I've found rdflib (https://github.com/RDFLib/rdflib) on the Python side 
> exceeding simple to work with and use. For example, to load the current 
> BIBFRAME vocabulary as an RDF graph using a Python shell:
> 
>>> import rdflib
>>> bf_vocab = rdflib.Graph().parse('http://bibframe.org/vocab/')
>>> len(bf_vocab) # Total number of triples
> 1683
>>> set([s for s in bf_vocab]) # A set of all unique subjects in the graph
> 
> 
> This module offers RDF/XML, Turtle, or N-triples support and with various 
> options for retrieving and manipulating the graph's subjects, predicate, and 
> objects. I would advise installing the JSON-LD 
> (https://github.com/RDFLib/rdflib-jsonld) extension as well.
> 
> Jeremy Nelson
> Metadata and Systems Librarian
> Colorado College
> 
> -----Original Message-----
> From: Code for Libraries [mailto:CODE4LIB@LISTSERV.ND.EDU] On Behalf Of Jean 
> Roth
> Sent: Tuesday, September 30, 2014 8:14 AM
> To: CODE4LIB@LISTSERV.ND.EDU
> Subject: [CODE4LIB] Python or Perl script for reading RDF/XML, Turtle, or 
> N-triples Files
> 
> Thank you so much for the reply.
> 
> I have not investigated the LCNAF data set thoroughly.  However, my 
> default/ideal is to read in all variables from a dataset.  
> 
> So, I was wondering if any one had an example Python or Perl script for 
> reading RDF/XML, Turtle, or N-triples file.  A simple/partial example would 
> be fine.
> 
> Thanks,
> 
> Jean
> 
> On Mon, 29 Sep 2014, Kyle Banerjee wrote:
> 
> KB> The best way to handle them depends on what you want to do. You need 
> KB> to actually download the NAF files rather than countries or other 
> KB> small files as different kinds of data will be organized 
> KB> differently. Just don't try to read multigigabyte files in a text 
> KB> editor :)
> KB> 
> KB> If you start with one of the giant XML files, the first thing you'll 
> KB> probably want to do is extract just the elements that are 
> KB> interesting to you. A short string parsing or SAX routine in your 
> KB> language of choice should let you get the information in a format you 
> like.
> KB> 
> KB> If you download the linked data files and you're interested in 
> KB> actual headings (as opposed to traversing relationships), grep and 
> KB> sed in combination with the join utility are handy for extracting 
> KB> the elements you want and flattening the relationships into 
> KB> something more convenient to work with. But there are plenty of other 
> tools that you could also use.
> KB> 
> KB> If you don't already have a convenient environment to work on, I'm a  
> KB> fan of virtualbox. You can drag and drop things into and out of your 
> KB> regular desktop or even access it directly. That way you can 
> KB> view/manipulate files with the linux utilities without having to 
> KB> deal with a bunch of clunky file transfer operations involving 
> KB> another machine. Very handy for when you have to deal with multigigabyte 
> files.
> KB> 
> KB> kyle
> KB> 
> KB> On Mon, Sep 29, 2014 at 11:19 AM, Jean Roth <jr...@nber.org> wrote:
> KB> 
> KB> > Thank you!  It looks like the files are available as  RDF/XML, 
> KB> > Turtle, or N-triples files.
> KB> >
> KB> > Any examples or suggestions for reading any of these formats?
> KB> >
> KB> > The MARC Countries file is small, 31-79 kb.  I assume a script 
> KB> > that would read a small file like that would at least be a start 
> KB> > for the LCNAF
> KB> >
> KB> >
> KB>

Re: [CODE4LIB] Python or Perl script for reading RDF/XML, Turtle, or N-triples Files

Reply via email to