Re: RIOT reader - new reader system for Jena.

Andy Seaborne Fri, 19 Oct 2012 04:29:39 -0700

On 18/10/12 22:21, Rob Vesse wrote:

Hey Andy


Sorry for taking forever to get back to you on this but comments inline:


On 8/17/12 5:54 AM, "Andy Seaborne" <[email protected]> wrote:

I'm at the point of being ready to integrate RIOT and anew reader system
into Jena properly.  This means we can remove the old parsers in
jena-core (not ARP).

There is a "but" however.

RIOT supports both triples and quads readers and model/graphs and
datasets/datasetgraphs ... but classes for all things quad are in ARQ.

I've created a JIRA but I thought I'd surface it here because it has the
potential to be disruptive.

https://issues.apache.org/jira/browse/JENA-300

== Integration

Possibilities:

1/ Put the code in ARQ
1a/ require a cal lto ARQ to initialize
1b/ make jena-core do as reflection call to ARQ initialization.

2/ Merge jena-arq and jena-core

The obvious issue for (2) is that the result is a big project to work
with.  Whether a larger jena-core really makes a difference in the real
world., I don't know.  Long term, some redivision into separate modules
would be good but it's quite hard to find any breakdown of core concepts
if you want testing by module.  It's hard to do anything much without a
memory graph implementation!

If (2),  it would be good to time this with making an uber jar
"jena-VERSION.jar" so that people switch to that and don't see any
future reorg of the modules unless they take a detailed look.



How about this as a suggestion for the short term:

- Move Quad and the riot sub-system into jena-core
- Replace the jena-core reader machinery with the riot sub-system

This has the advantage of keeping everything query still in it's own
module and does not need to break down core.  Ideally it would be nice to
split off the riot sub-system into it's own module but then you get into
problems of there being no reader/writer sub-system in core and requiring
users to pull in an extra dependency for one of the most common things
they are going to do.  I assume you plan to integrate this after 2.7.4
perhaps with a minor version bump I.e. 2.8.0

Longer term I tried to think of some ways to nicely separate things out
but was kinda struggling, with the Model interface as it stands (wit it's
own read()/write() methods) there is no way to cleanly separate the riot
sub-system out from jena-core/jena-arq in the same way that Sesame
separate their IO subsystem into their RIO modules.  They have a
sesame-rio-api module and then specific small modules implementing each
reader.

If we could remove the read()/write() methods from Model then we can start
to get a better separation of concerns:

- Interfaces for reading/writing form a jena-riot-api module
- Implementations form another module jena-riot-std module

In the place of a read()/write() method directly on a Model we can provide
a static ModelIO class with read() and write() methods. Wiring up of
readers and writers for use by this could perhaps be done automagically
through some package scanning and Java attributes combination?

Hope these thoughts help

Rob

Moving just RIOT out of ARQ is the way to go. It's not just Quad - it'sDataset as well which is public API. Quad informally is as well so itneeds to be coupled with a significant version change. While not in theAPI, extensions and deep working with ARQ does tend to arrive at Quad.

After that, its the effect of pulling the thread that yanks more stuff.The jena-*-api module idea would work although maybe some testingmight need to be put into a testing module to get ordering right. Hardto test APIs without an implementation to hand.

RIOT would be its own subsystem - it does not need all of jena-core (itshould not need the client API for example, or OntAPI, it does needdatatypes).

I'm not convinced about one module per parser because they (this is "notRDF/XML") share so much but one module for RIOT would be ideal. Iconfess I don't like it when the internal need for a module structureends up dictaing the public API design - sometimes a public API with amix of things is easier to use but the mix may be across internal design.

All .read calls do become legacy ways of getting to a library - it isinverting the structure. What it bites is WebReader2 - the class ofstatic functions that reads things. It has both Model and Dataset calls.

Having ModelIO for all Model calls and DatasetIO for all datasets callsis a good thought. I'll give that a go and so if the dependencies work out.

But it is a nice example of where internal divisions force public API.What if you read a web location, not knowing if its triples or quads?Be nicer to get the right thing back not have to decide before the callwhether it's triples or whether if quads.

All .read calls do become legacy ways of getting to the new readerstructure. riot-reader rewires existing Jena to route the .read callsto one piece of code. It was one single reader but two places requirethe language to be known in advance so it is one very thin reader perlanguage to add in the default value. The existing code doesnewInstance after deciding the language. The new code delays thelanguage decision until after conneg.

If we have a single "one jar", then trick of jena-core making areflection call on RIOT to initialize with RIOT reader will work. Theuser will see new code without us beforehand having to undergo a deeprestructure but eventual the restructure should happen, on a timescalethat is relaxed, not forced by release cycles.


        Andy


== Outline of the reader

There is a single class "WebReader2" that captures the process of
opening a connection to a resource/file/thing, deciding the syntax and
then calling the right parser.  This adds full http content negotiation
over what Jena currently does.

You can add new content-types and connect to the appropriate parser code.

It includes going through FileManager and if/when that connected to
model.read, all the conneg, redirection and location mapping is made
fundamental. You can even could make all URLs of a pattern
   http://myhost/data/turtle/file{n}
be Turtle files despite being served as text/plain.

== Code

In an "Experimental" project:

https://svn.apache.org/repos/asf/jena/Experimental/riot-reader/

Code browse;
https://svn.apache.org/viewvc/jena/Experimental/riot-reader/src/main/java/
riot_reader/

The package layout isn't right for integration.

Re: RIOT reader - new reader system for Jena.

Reply via email to