All sounds rather good

I have some thoughts on the lang as a hint notion from my experience of
how I did this in dotNetRDF.

Firstly the common versions of the read() method equivalents are all
method overloads that do not require a lang argument.  Whenever these are
called we fully rely on content negotiation, file extension based guessing
or at worst reading in the data and doing some simple string heuristics
(e.g. does it contain rdf:RDF and other XML stuff then it's likely
RDF/XML) depending on whether the source is a URL, file or string.   For
this case parser selection is fully up to the library, though what parsers
are mapped to which MIME types and file extensions is configurable which
if I understood your email would also be the case in Jena?

For the overloads that do take a language we take that as a strong hint:
- For URIs we use the stated language to constrain the Accept header sent
to just the known MIME types for that language and then use the parser
corresponding to the hint language regardless of the returned content type
(assuming the server doesn't return an error response)
- For any other resource we use the parser corresponding to the language
regardless and skip any file extension and other format guessing we would
normally perform

Yes this means that it is possible for the user to invoke a read()
operation than is doomed to fail e.g. model.read("example.rdf", "TTL", "")
but it lets the user defer to the libraries best guess in most cases and
override it only when needed e.g. like in the text/plain case you highlight

As to whether we can easily add new overloads of read() that don't require
an explicit language without breaking existing things I don't know.  I
assume model.read() is in one of the core model interfaces?

Hope this helps,

Rob


On 7/16/12 10:19 AM, "Andy Seaborne" <[email protected]> wrote:

>I've been working (in my ASF scratch area) on a new I/O subsystem to
>replace the current one in Jena.
>
>+ Replaces Turtle, NTriples completely with the RIOT ones, removing the
>current jena-core parsers.
>
>+ Adds content negotiation for the syntax when reading from URLs
>
>+ FileManager-like functionality when doing model.read.
>
>It's nearly ready to merged into jena-core.
>
>This message is an update, and a request for comments and concerns,
>particular any specific things we need to ensure compatibility.
>
>It's nearly ready to merged into jena-core then a few last things can be
>done (one or two cases can't be done checked with the current RIOT
>wiring in setup because a few things are hard wired into ModelCom).
>
>An attempt to branch then merge should be possible in the next few weeks.
>
>- - - - - - - -
>
>WebReader is a new class of static methods that does everything through
>one algorithm.  There are lots of table-driven look ups so adding new
>languages will be possible - the usual suspects are all added as
>"extensions" to an empty base setup.
>
>Internally, it's driven by content type. Any "file open" generated a
>typed stream - the type is the content type (file extensions used for
>files).  This is different from current Jena where the language is
>chosen before any attempt to open a connection is made.
>
>1/ All file opening will go via the filemanager including
>model.read(url) so it covers HTTP, files (and Java resources if we want
>to - it's just how the default filemanager is setup).
>
>2/ model.read(url) does content negotiation over HTTP and looks at file
>extension for files.  And it looks at URl extension when it's text/plain
>on the basis that dropping files in a directory on a web server means
>that are served text/plain.
>
>3/ RDFReaderF (the factory part) would be removed.
>
>As I discovered, a lot of stuff is hardwired anyway because there is
>static use of RDFReaderFImpl and one model.read operation has file
>opening hardwired.
>
>4/ Backwards compatibility for ARP
>
>When asking for an RDF Reader for RDF/XML, a special reader is returned
>which wraps the current ARP reader so setting properties for a custom
>reader works. But it does fix up file looking things by adding "file:".
>
>This is used only when RDF/XML is explicitly requested.  Otherwise,
>conneg and file ext guessing happens.  File extension is basic content
>type choosing for files.
>
>No other languages have any settable reader features.  There is a
>universal RDFReader for everything so model.getReader(lang) works.
>
>5/ model.read is a compatibility wrapper.
>
>WebReader.read(...) is the key operation, inverting the idea that models
>can have specialised readers - I'm not aware of this being used at all
>and in fact think it's not possible because some things are built into
>ModelCom.
>
>ModelCom has:
>   private static final RDFReaderF readerFactory = new RDFReaderFImpl();
>
>so only variation by model.getReader(lang) works
>
>
>What to do about difference of opinion as to the MIME type ....
>
>The model.read(..."lang"...) regard lang as a hint.  Into the mix goes
>the stream content type, and the hint.  File extension sets the content
>type.
>
>But they can disagree so what's the best thing to believe?
>
>If the content type is text/plain, then the hint language is used.
>
>because dropping a file in a directory on a webserver is likely to en up
>with it as text/plain.  File extension is used for HTTP (!!).
>
>If the content type is not text/plain, at the moment the hint language
>is ignored.
>
>I was tempted to say that the hint language overrides anything
>discovered; this gets files with one extension which are actually
>another right (use case: ntriples that is really turtle).
>
>But it gets wrong in the httpd case of ask for "foo", hint it is RDF/XML
>and get back explicitly application/turtle (the right answer is now
>TTL), is wrong.
>
>So to force the type, open a stream to the thing and then pass the
>(untyped) stream and a hint.  It's possible to force a particular reader
>but you have to do it a certain way.
>
>And assume people don't use .ttl for RDF/XML files very much.
>
>This may need fine tuning in the light of experience.
>
>       Andy

Reply via email to