On 16/07/12 18:44, Rob Vesse wrote:
All sounds rather good
I have some thoughts on the lang as a hint notion from my experience of
how I did this in dotNetRDF.
Firstly the common versions of the read() method equivalents are all
method overloads that do not require a lang argument. Whenever these are
called we fully rely on content negotiation, file extension based guessing
or at worst reading in the data and doing some simple string heuristics
(e.g. does it contain rdf:RDF and other XML stuff then it's likely
RDF/XML) depending on whether the source is a URL, file or string. For
this case parser selection is fully up to the library, though what parsers
are mapped to which MIME types and file extensions is configurable which
if I understood your email would also be the case in Jena?
Yes
And I'd hope this (not specifying a lang) became the normal way to read
data but it isn't at the moment.
For the overloads that do take a language we take that as a strong hint:
- For URIs we use the stated language to constrain the Accept header sent
to just the known MIME types for that language and then use the parser
corresponding to the hint language regardless of the returned content type
(assuming the server doesn't return an error response)
- For any other resource we use the parser corresponding to the language
regardless and skip any file extension and other format guessing we would
normally perform
The current situation is that all calls have a hint language so I think
we're in a different place to dotNetRDF because we start with lang being
used a lot.
There is one general accept header - it asks for all types in some q
order. This allows for some form input of "data.ttl", old style code
has "RDF/XML" and the server says "text/turtle".
From where Jena apps start, it seems to me that "weak hint" is more
useful but I'd agree it's fine balance.
If we were starting from a clean slate, I'd be more in favour of strong
(absolute) hinting.
Yes this means that it is possible for the user to invoke a read()
operation than is doomed to fail e.g. model.read("example.rdf", "TTL", "")
but it lets the user defer to the libraries best guess in most cases and
override it only when needed e.g. like in the text/plain case you highlight
They can do that by opening the stream themselves and then asking to
read from the stream + their choice of hint. More hoops but we can't
have everything and I don't want to get into trying multiple parsers.
That can be done although it's non-trivial - need a locally buffered
input stream to back up over. That would be ideal even if it kills
performance but only for the mismatch of hint and actual.
(The buffering will find it hard to get out of the way after the right
lang is found).
As to whether we can easily add new overloads of read() that don't require
an explicit language without breaking existing things I don't know. I
assume model.read() is in one of the core model interfaces?
Yes - overloading would be good .. if it works, which it doesn't easily
transparently ... next message. (Beeping typed languages.)
Andy
Hope this helps,
Rob
On 7/16/12 10:19 AM, "Andy Seaborne" <[email protected]> wrote:
I've been working (in my ASF scratch area) on a new I/O subsystem to
replace the current one in Jena.
+ Replaces Turtle, NTriples completely with the RIOT ones, removing the
current jena-core parsers.
+ Adds content negotiation for the syntax when reading from URLs
+ FileManager-like functionality when doing model.read.
It's nearly ready to merged into jena-core.
This message is an update, and a request for comments and concerns,
particular any specific things we need to ensure compatibility.
It's nearly ready to merged into jena-core then a few last things can be
done (one or two cases can't be done checked with the current RIOT
wiring in setup because a few things are hard wired into ModelCom).
An attempt to branch then merge should be possible in the next few weeks.
- - - - - - - -
WebReader is a new class of static methods that does everything through
one algorithm. There are lots of table-driven look ups so adding new
languages will be possible - the usual suspects are all added as
"extensions" to an empty base setup.
Internally, it's driven by content type. Any "file open" generated a
typed stream - the type is the content type (file extensions used for
files). This is different from current Jena where the language is
chosen before any attempt to open a connection is made.
1/ All file opening will go via the filemanager including
model.read(url) so it covers HTTP, files (and Java resources if we want
to - it's just how the default filemanager is setup).
2/ model.read(url) does content negotiation over HTTP and looks at file
extension for files. And it looks at URl extension when it's text/plain
on the basis that dropping files in a directory on a web server means
that are served text/plain.
3/ RDFReaderF (the factory part) would be removed.
As I discovered, a lot of stuff is hardwired anyway because there is
static use of RDFReaderFImpl and one model.read operation has file
opening hardwired.
4/ Backwards compatibility for ARP
When asking for an RDF Reader for RDF/XML, a special reader is returned
which wraps the current ARP reader so setting properties for a custom
reader works. But it does fix up file looking things by adding "file:".
This is used only when RDF/XML is explicitly requested. Otherwise,
conneg and file ext guessing happens. File extension is basic content
type choosing for files.
No other languages have any settable reader features. There is a
universal RDFReader for everything so model.getReader(lang) works.
5/ model.read is a compatibility wrapper.
WebReader.read(...) is the key operation, inverting the idea that models
can have specialised readers - I'm not aware of this being used at all
and in fact think it's not possible because some things are built into
ModelCom.
ModelCom has:
private static final RDFReaderF readerFactory = new RDFReaderFImpl();
so only variation by model.getReader(lang) works
What to do about difference of opinion as to the MIME type ....
The model.read(..."lang"...) regard lang as a hint. Into the mix goes
the stream content type, and the hint. File extension sets the content
type.
But they can disagree so what's the best thing to believe?
If the content type is text/plain, then the hint language is used.
because dropping a file in a directory on a webserver is likely to en up
with it as text/plain. File extension is used for HTTP (!!).
If the content type is not text/plain, at the moment the hint language
is ignored.
I was tempted to say that the hint language overrides anything
discovered; this gets files with one extension which are actually
another right (use case: ntriples that is really turtle).
But it gets wrong in the httpd case of ask for "foo", hint it is RDF/XML
and get back explicitly application/turtle (the right answer is now
TTL), is wrong.
So to force the type, open a stream to the thing and then pass the
(untyped) stream and a hint. It's possible to force a particular reader
but you have to do it a certain way.
And assume people don't use .ttl for RDF/XML files very much.
This may need fine tuning in the light of experience.
Andy