OK, thank you for making this explicit. I suppose my curiosity here
revolved around where we (as an Any23 community) could/want to get
involved in making Any23 a better framework and potentially a
dependency within the semantic web projects within the ASF.

  however I can't help but see/think that there are areas where we
(Any23 Jena) can find commonality.

It would be good.  Add Stanbol and the-project-née-Linda.

together with a new I/O architecture:


accepted 100%

which is now ready for migrating into the codebase (after a pause due RDF-WG
work and non-Apache time).

Now done ...


accepted 110%


In particular, the parser pipeline is have been heavily tuned to get load
performance for TDB.  (Long story to do with how Java I/O has hidden costs.)


Jena framework specific?

Yes and no.

"Yes" -- the parsers use Jena classes but very few.

"no" -- but only as carriers for triples and terms. Output is to a Sink<Triple>, so that can be directly to a graph, a print stream, direct to storage (TDB), a stream-filter, whatever.

The carrier objects are from Jena's SPI - AKA the graph API, which is just Graph/Triple/Node/DatasetGraph/Quad (+datatypes).

ARP (the RDF/XML parser) does have it's own abstraction of nodes to isolate it from the rest of jena. Once upon a time it did run separately (it still can but it's packaged with jena now). All the RIOT parsers are doing is using a zero-copy approach to the same thing. Churning objects during n-triples parsing is a measurable cost. The RIOT N-triples parser does about 200K+ triples/s in ideal conditions [2].


The Jena API is built on the SPI - the API is much bigger than the SPI which is really quite small and could be smaller.

        Andy

[1] http://mail-archives.apache.org/mod_mbox/jena-dev/201207.mbox/%3C5009735B.5020908%40apache.org%3E

[2] ideal: server or workstation class PC not doing anything else at the time. No other disk activity, no CPU activity. Materialise triples but send to a Sink that throws everything away.

gzip vs raw expanded file makes a small difference - raw is faster, but then very large NT files are often written all in one go so that are laid out well on disk for the disk interface to stream and SSDs are not that much faster if I/O is not random (I see < x2 faster for > x10 the cost mentioned, presumably the x10 is dropping)



PS the Turtle parser is compliant with the latest RDF 1.1 spec and the draft
RDF 1.1 Turtle test suite.

Do we have these implementations over @Any23?

So I suppose the underlying question/conversation/discussion I was
putting forward concerns where, how and if both projects can benefit?
We both (communities) have tried to have this before... however now as
the Scottish National Football team are non-existent, I really have
nothing to do...

I know this is not a trivial issue... however I hope we are moving in
the right direction.

Yes.

The negative side


  Lewis


Reply via email to