Re: Support for Non Blocking Parsers

Andy Seaborne Mon, 30 Jan 2012 05:24:06 -0800

Yes, but :-) that's without writing any kind of adaptor code.

I was looking for a way to reuse the existing parser code. If you wantto start from scratch then it's a different ball game.


There are two cases:

RDF/XML: (Yuk) Jena uses an XML parser - the first point is finding asuitable XML parser - the SAX interface means it might be possible toadapt to being a pipeline based process.

The rest: Turtle parsers are quite easy to write. In fact, the actuallyparser isn't really the bulk of the work.

The purest actor-style implementation needs to spit out the parsingphases: bytes to chars, chars to tokens, tokens to triples. Each ofthose steps is a small state machine but it looks a whole lot easier towrite as separate FSMs. Even UTF-8 chars can be split across bytebuffer boundaries.


Practical points:

1/ for all the small documents (say, less than 50K) it might be simplerto gather the bytes together and parse whole documents. Then devote athread to large documents - assumes you get Content-Length. This isn'tas ideal as a compete rewrite but it's less work. Isn't thread stacksize is key determinant of space used?

2/ Have X threads (where X ~ # cores), use a executor pool requeststogether. The far wil start seding and it will be buffered in the lowlevels. There aren't any extra CPU cycles to go round so while it'sbatch-y it isn't going to go fast with more active parsers.

I am interested in the question on rendezvous still by the way - howdoes the app want to be notified parsing has finished and does it nottouch the model during this time?


        Andy

On 30/01/12 12:57, Henry Story wrote:

So I wrote out a gist that shows how one should be able to use Jena Parsers
It is here:

    https://gist.github.com/1704255

But I get the exception

ERROR (WebFetcher.scala:59) : org.xml.sax.SAXParseException; systemId: 
http://bblfish.net/people/henry/card.rdf; lineNumber: 134; columnNumber: 44; 
XML document structures must start and end within the same entity.
com.hp.hpl.jena.shared.JenaException: org.xml.sax.SAXParseException; systemId: 
http://bblfish.net/people/henry/card.rdf; lineNumber: 134; columnNumber: 44; 
XML document structures must start and end within the same entity.
        at 
com.hp.hpl.jena.rdf.model.impl.RDFDefaultErrorHandler.fatalError(RDFDefaultErrorHandler.java:60)
        at 
com.hp.hpl.jena.rdf.arp.impl.ARPSaxErrorHandler.fatalError(ARPSaxErrorHandler.java:51)
        at com.hp.hpl.jena.rdf.arp.impl.XMLHandler.warning(XMLHandler.java:211)
        at 
com.hp.hpl.jena.rdf.arp.impl.XMLHandler.fatalError(XMLHandler.java:241)
        at o

As expected, because there one cannot pass partial documents to the reader.

Henry


On 29 Jan 2012, at 23:52, Henry Story wrote:


On 29 Jan 2012, at 23:28, Henry Story wrote:


On 29 Jan 2012, at 23:04, Andy Seaborne wrote:

Hi Henry,

On 29/01/12 21:40, Henry Story wrote:

[ I just opened a bug report for this, but it was suggested that a wider
discussion on how to do it would be useful on this list. ]


The thread of interest is:

http://www.mail-archive.com/[email protected]/msg02451.html

Unless I am mistaken the only way to parse some content is using methods that 
use an
InputStream such as this:

   val m = ModelFactory.createDefaultModel()
    m.getReader(lang.jenaLang).read(m, in, base.toString)


As already commented on the thread, passing the reader to an actor allows async 
reading.  Readers are configurable - you can have anything you like.  No reason 
why the RDFReader can't be using async NIO.


Mhh, can I call at time t1

  reader.read( model, inputStream, base);

with an inputStream that only contains a chunk of the data? And then call it 
again with
another chunk of the data later with a newly filled input stream that contains 
the next segment
of the data?

  reader.read( model, inputStream2, base);

It says nothing about that in the documentation, so I just assumed it does not 
work...


Well I did look at the code (but perhaps not deeply enough, and only the 
released
version of Jena). From that I got the feeling that one has to send one whole RDF
document down an input stream at a time.

If one cannot send chunks to the reader then essentially the thread that calls 
the
read(...) method above will block until the whole document is read in. Even if 
an
actor calls that method, the actor will then block the thread that it is 
executing
in until it  is finished. So actors don't help (unless there is some magic I 
don't
know about). Now if the server serving the document is serving it at 56 bauds, 
really
slowly, then one thread could be used up even though it is producing very very
little work.

If on the other hand I could send partial pieces of XML documents down different
input streams and different times, then the NIO thread could call the reader
every time it received some data. For example in the code I was writing here 
using the
http-async-client https://gist.github.com/1701141

The method I have now on line 39-42

  def onBodyPartReceived(bodyPart: HttpResponseBodyPart) = {
    bodyPart.writeTo(out)
    STATE.CONTINUE
  }


  could be changed to

  def onBodyPartReceived(bodyPart: HttpResponseBodyPart) = {
    reader.read(model, new ByteArrayInputStream(bodyPart.getBodyPartBytes(), 
base)
    STATE.CONTINUE
  }

  and so the body part would be consumed by the read in chunks.


There is also RIOT - have you looked parsing the read request to a parser in an 
actor, the catching the Sink<Triple>  interface for the return -- that wokrs in 
an actor style.

The key question is what Jena can enable,  this so that possibilities can be 
built on top.  I don't think jena is a good level to pick one approach over 
another as it is in danger of clashing with other choice in the application.  
Your akka is a good example of one possible choice.

I did open the issue-203 so that when we agree on a solution we could send in
some patches.


Look forward to seeing this,

        Andy


Social Web Architect
http://bblfish.net/


Social Web Architect
http://bblfish.net/


Social Web Architect
http://bblfish.net/

Re: Support for Non Blocking Parsers

Reply via email to