Re: Support for Non Blocking Parsers

Henry Story Mon, 30 Jan 2012 13:37:27 -0800

On 30 Jan 2012, at 14:23, Andy Seaborne wrote:

> Yes, but :-) that's without writing any kind of adaptor code.
> 
> I was looking for a way to reuse the existing parser code.  If you want to 
> start from scratch then it's a different ball game.
> 
> There are two cases:
> 
> RDF/XML: (Yuk) Jena uses an XML parser - the first point is finding a 
> suitable XML parser - the SAX interface means it might be possible to adapt 
> to being a pipeline based process.


Well yes, the good thing about RDF/XML is that I think now nobody cares 
anymore. :-)
I am told this Apache Licenced parser is very good.

  https://github.com/FasterXML/aalto-xml

How difficult would it be to use that?

> 
> The rest: Turtle parsers are quite easy to write.  In fact, the actually 
> parser isn't really the bulk of the work.
> 
> The purest actor-style implementation needs to spit out the parsing phases: 
> bytes to chars, chars to tokens, tokens to triples.  Each of those steps is a 
> small state machine but it looks a whole lot easier to write as separate 
> FSMs.  Even UTF-8 chars can be split across byte buffer boundaries.

I doing some research there.

> 
> Practical points:
> 
> 1/ for all the small documents (say, less than 50K) it might be simpler to 
> gather the bytes together and parse whole documents.  Then devote a thread to 
> large documents - assumes you get Content-Length.  This isn't as ideal as a 
> compete rewrite but it's less work.  Isn't thread stack size is key 
> determinant of space used?

yes, that's an ugly band aid, but I'll use that in the mean time, as I would 
like to get more familiar with actor based programming.

> 2/ Have X threads (where X ~ # cores), use a executor pool requests together. 
>  The far wil start seding and it will be buffered in the low levels.  There 
> aren't any extra CPU cycles to go round so while it's batch-y it isn't going 
> to go fast with more active parsers.

I think without getting the parsers to be non blocking, everything else is just 
going to be ugly and inefficient. Getting the parsers to be non blocking will 
make everything else just clean and seamless. 

For example one could easily create a proxy that could proxy 1 GB of RDF files 
and only use up a few kbytes of memory, by simply reading in triples and 
spitting them out in another format on the other end, before even the first 
document had finished parsing.



> 
> I am interested in the question on rendezvous still by the way - how does the 
> app want to be notified parsing has finished and does it not touch the model 
> during this time?
> 
>       Andy
> 
> On 30/01/12 12:57, Henry Story wrote:
>> So I wrote out a gist that shows how one should be able to use Jena Parsers
>> It is here:
>> 
>>    https://gist.github.com/1704255
>> 
>> But I get the exception
>> 
>> ERROR (WebFetcher.scala:59) : org.xml.sax.SAXParseException; systemId: 
>> http://bblfish.net/people/henry/card.rdf; lineNumber: 134; columnNumber: 44; 
>> XML document structures must start and end within the same entity.
>> com.hp.hpl.jena.shared.JenaException: org.xml.sax.SAXParseException; 
>> systemId: http://bblfish.net/people/henry/card.rdf; lineNumber: 134; 
>> columnNumber: 44; XML document structures must start and end within the same 
>> entity.
>>      at 
>> com.hp.hpl.jena.rdf.model.impl.RDFDefaultErrorHandler.fatalError(RDFDefaultErrorHandler.java:60)
>>      at 
>> com.hp.hpl.jena.rdf.arp.impl.ARPSaxErrorHandler.fatalError(ARPSaxErrorHandler.java:51)
>>      at com.hp.hpl.jena.rdf.arp.impl.XMLHandler.warning(XMLHandler.java:211)
>>      at 
>> com.hp.hpl.jena.rdf.arp.impl.XMLHandler.fatalError(XMLHandler.java:241)
>>      at o
>> 
>> As expected, because there one cannot pass partial documents to the reader.
>> 
>> Henry
>> 
>> 
>> On 29 Jan 2012, at 23:52, Henry Story wrote:
>> 
>>> 
>>> On 29 Jan 2012, at 23:28, Henry Story wrote:
>>> 
>>>> 
>>>> On 29 Jan 2012, at 23:04, Andy Seaborne wrote:
>>>> 
>>>>> Hi Henry,
>>>>> 
>>>>> On 29/01/12 21:40, Henry Story wrote:
>>>>>> [ I just opened a bug report for this, but it was suggested that a wider
>>>>>> discussion on how to do it would be useful on this list. ]
>>>>> 
>>>>> The thread of interest is:
>>>>> 
>>>>> http://www.mail-archive.com/[email protected]/msg02451.html
>>>>> 
>>>>>> Unless I am mistaken the only way to parse some content is using methods 
>>>>>> that use an
>>>>>> InputStream such as this:
>>>>>> 
>>>>>>   val m = ModelFactory.createDefaultModel()
>>>>>>    m.getReader(lang.jenaLang).read(m, in, base.toString)
>>>>> 
>>>>> As already commented on the thread, passing the reader to an actor allows 
>>>>> async reading.  Readers are configurable - you can have anything you 
>>>>> like.  No reason why the RDFReader can't be using async NIO.
>>>> 
>>>> Mhh, can I call at time t1
>>>> 
>>>>  reader.read( model, inputStream, base);
>>>> 
>>>> with an inputStream that only contains a chunk of the data? And then call 
>>>> it again with
>>>> another chunk of the data later with a newly filled input stream that 
>>>> contains the next segment
>>>> of the data?
>>>> 
>>>>  reader.read( model, inputStream2, base);
>>>> 
>>>> It says nothing about that in the documentation, so I just assumed it does 
>>>> not work...
>>> 
>>> Well I did look at the code (but perhaps not deeply enough, and only the 
>>> released
>>> version of Jena). From that I got the feeling that one has to send one 
>>> whole RDF
>>> document down an input stream at a time.
>>> 
>>> If one cannot send chunks to the reader then essentially the thread that 
>>> calls the
>>> read(...) method above will block until the whole document is read in. Even 
>>> if an
>>> actor calls that method, the actor will then block the thread that it is 
>>> executing
>>> in until it  is finished. So actors don't help (unless there is some magic 
>>> I don't
>>> know about). Now if the server serving the document is serving it at 56 
>>> bauds, really
>>> slowly, then one thread could be used up even though it is producing very 
>>> very
>>> little work.
>>> 
>>> If on the other hand I could send partial pieces of XML documents down 
>>> different
>>> input streams and different times, then the NIO thread could call the reader
>>> every time it received some data. For example in the code I was writing 
>>> here using the
>>> http-async-client https://gist.github.com/1701141
>>> 
>>> The method I have now on line 39-42
>>> 
>>>  def onBodyPartReceived(bodyPart: HttpResponseBodyPart) = {
>>>    bodyPart.writeTo(out)
>>>    STATE.CONTINUE
>>>  }
>>> 
>>> 
>>>  could be changed to
>>> 
>>>  def onBodyPartReceived(bodyPart: HttpResponseBodyPart) = {
>>>    reader.read(model, new ByteArrayInputStream(bodyPart.getBodyPartBytes(), 
>>> base)
>>>    STATE.CONTINUE
>>>  }
>>> 
>>>  and so the body part would be consumed by the read in chunks.
>>> 
>>>> 
>>>>> 
>>>>> There is also RIOT - have you looked parsing the read request to a parser 
>>>>> in an actor, the catching the Sink<Triple>  interface for the return -- 
>>>>> that wokrs in an actor style.
>>>>> 
>>>>> The key question is what Jena can enable,  this so that possibilities can 
>>>>> be built on top.  I don't think jena is a good level to pick one approach 
>>>>> over another as it is in danger of clashing with other choice in the 
>>>>> application.  Your akka is a good example of one possible choice.
>>>>> 
>>>>>> I did open the issue-203 so that when we agree on a solution we could 
>>>>>> send in
>>>>>> some patches.
>>>>> 
>>>>> Look forward to seeing this,
>>>>> 
>>>>>   Andy
>>>> 
>>>> Social Web Architect
>>>> http://bblfish.net/
>>>> 
>>> 
>>> Social Web Architect
>>> http://bblfish.net/
>>> 
>> 
>> Social Web Architect
>> http://bblfish.net/
>> 
> 

Social Web Architect
http://bblfish.net/

Re: Support for Non Blocking Parsers

Reply via email to