Ok, I got the asynchronous parser to work for rdf/xml . Details on the bug 
report

  https://issues.apache.org/jira/browse/JENA-203

Henry

PS. I wonder why I can't find this thread in the archive
  
http://mail-archives.apache.org/mod_mbox/incubator-jena-users/201201.mbox/browser


On 30 Jan 2012, at 22:36, Henry Story wrote:

> 
> On 30 Jan 2012, at 14:23, Andy Seaborne wrote:
> 
>> Yes, but :-) that's without writing any kind of adaptor code.
>> 
>> I was looking for a way to reuse the existing parser code.  If you want to 
>> start from scratch then it's a different ball game.
>> 
>> There are two cases:
>> 
>> RDF/XML: (Yuk) Jena uses an XML parser - the first point is finding a 
>> suitable XML parser - the SAX interface means it might be possible to adapt 
>> to being a pipeline based process.
> 
> Well yes, the good thing about RDF/XML is that I think now nobody cares 
> anymore. :-)
> I am told this Apache Licenced parser is very good.
> 
>  https://github.com/FasterXML/aalto-xml
> 
> How difficult would it be to use that?
> 
>> 
>> The rest: Turtle parsers are quite easy to write.  In fact, the actually 
>> parser isn't really the bulk of the work.
>> 
>> The purest actor-style implementation needs to spit out the parsing phases: 
>> bytes to chars, chars to tokens, tokens to triples.  Each of those steps is 
>> a small state machine but it looks a whole lot easier to write as separate 
>> FSMs.  Even UTF-8 chars can be split across byte buffer boundaries.
> 
> I doing some research there.
> 
>> 
>> Practical points:
>> 
>> 1/ for all the small documents (say, less than 50K) it might be simpler to 
>> gather the bytes together and parse whole documents.  Then devote a thread 
>> to large documents - assumes you get Content-Length.  This isn't as ideal as 
>> a compete rewrite but it's less work.  Isn't thread stack size is key 
>> determinant of space used?
> 
> yes, that's an ugly band aid, but I'll use that in the mean time, as I would 
> like to get more familiar with actor based programming.
> 
>> 2/ Have X threads (where X ~ # cores), use a executor pool requests 
>> together.  The far wil start seding and it will be buffered in the low 
>> levels.  There aren't any extra CPU cycles to go round so while it's batch-y 
>> it isn't going to go fast with more active parsers.
> 
> I think without getting the parsers to be non blocking, everything else is 
> just going to be ugly and inefficient. Getting the parsers to be non blocking 
> will make everything else just clean and seamless. 
> 
> For example one could easily create a proxy that could proxy 1 GB of RDF 
> files and only use up a few kbytes of memory, by simply reading in triples 
> and spitting them out in another format on the other end, before even the 
> first document had finished parsing.
> 
> 
> 
>> 
>> I am interested in the question on rendezvous still by the way - how does 
>> the app want to be notified parsing has finished and does it not touch the 
>> model during this time?
>> 
>>      Andy
>> 
>> On 30/01/12 12:57, Henry Story wrote:
>>> So I wrote out a gist that shows how one should be able to use Jena Parsers
>>> It is here:
>>> 
>>>   https://gist.github.com/1704255
>>> 
>>> But I get the exception
>>> 
>>> ERROR (WebFetcher.scala:59) : org.xml.sax.SAXParseException; systemId: 
>>> http://bblfish.net/people/henry/card.rdf; lineNumber: 134; columnNumber: 
>>> 44; XML document structures must start and end within the same entity.
>>> com.hp.hpl.jena.shared.JenaException: org.xml.sax.SAXParseException; 
>>> systemId: http://bblfish.net/people/henry/card.rdf; lineNumber: 134; 
>>> columnNumber: 44; XML document structures must start and end within the 
>>> same entity.
>>>     at 
>>> com.hp.hpl.jena.rdf.model.impl.RDFDefaultErrorHandler.fatalError(RDFDefaultErrorHandler.java:60)
>>>     at 
>>> com.hp.hpl.jena.rdf.arp.impl.ARPSaxErrorHandler.fatalError(ARPSaxErrorHandler.java:51)
>>>     at com.hp.hpl.jena.rdf.arp.impl.XMLHandler.warning(XMLHandler.java:211)
>>>     at 
>>> com.hp.hpl.jena.rdf.arp.impl.XMLHandler.fatalError(XMLHandler.java:241)
>>>     at o
>>> 
>>> As expected, because there one cannot pass partial documents to the reader.
>>> 
>>> Henry
>>> 
>>> 
>>> On 29 Jan 2012, at 23:52, Henry Story wrote:
>>> 
>>>> 
>>>> On 29 Jan 2012, at 23:28, Henry Story wrote:
>>>> 
>>>>> 
>>>>> On 29 Jan 2012, at 23:04, Andy Seaborne wrote:
>>>>> 
>>>>>> Hi Henry,
>>>>>> 
>>>>>> On 29/01/12 21:40, Henry Story wrote:
>>>>>>> [ I just opened a bug report for this, but it was suggested that a wider
>>>>>>> discussion on how to do it would be useful on this list. ]
>>>>>> 
>>>>>> The thread of interest is:
>>>>>> 
>>>>>> http://www.mail-archive.com/[email protected]/msg02451.html
>>>>>> 
>>>>>>> Unless I am mistaken the only way to parse some content is using 
>>>>>>> methods that use an
>>>>>>> InputStream such as this:
>>>>>>> 
>>>>>>>  val m = ModelFactory.createDefaultModel()
>>>>>>>   m.getReader(lang.jenaLang).read(m, in, base.toString)
>>>>>> 
>>>>>> As already commented on the thread, passing the reader to an actor 
>>>>>> allows async reading.  Readers are configurable - you can have anything 
>>>>>> you like.  No reason why the RDFReader can't be using async NIO.
>>>>> 
>>>>> Mhh, can I call at time t1
>>>>> 
>>>>> reader.read( model, inputStream, base);
>>>>> 
>>>>> with an inputStream that only contains a chunk of the data? And then call 
>>>>> it again with
>>>>> another chunk of the data later with a newly filled input stream that 
>>>>> contains the next segment
>>>>> of the data?
>>>>> 
>>>>> reader.read( model, inputStream2, base);
>>>>> 
>>>>> It says nothing about that in the documentation, so I just assumed it 
>>>>> does not work...
>>>> 
>>>> Well I did look at the code (but perhaps not deeply enough, and only the 
>>>> released
>>>> version of Jena). From that I got the feeling that one has to send one 
>>>> whole RDF
>>>> document down an input stream at a time.
>>>> 
>>>> If one cannot send chunks to the reader then essentially the thread that 
>>>> calls the
>>>> read(...) method above will block until the whole document is read in. 
>>>> Even if an
>>>> actor calls that method, the actor will then block the thread that it is 
>>>> executing
>>>> in until it  is finished. So actors don't help (unless there is some magic 
>>>> I don't
>>>> know about). Now if the server serving the document is serving it at 56 
>>>> bauds, really
>>>> slowly, then one thread could be used up even though it is producing very 
>>>> very
>>>> little work.
>>>> 
>>>> If on the other hand I could send partial pieces of XML documents down 
>>>> different
>>>> input streams and different times, then the NIO thread could call the 
>>>> reader
>>>> every time it received some data. For example in the code I was writing 
>>>> here using the
>>>> http-async-client https://gist.github.com/1701141
>>>> 
>>>> The method I have now on line 39-42
>>>> 
>>>> def onBodyPartReceived(bodyPart: HttpResponseBodyPart) = {
>>>>   bodyPart.writeTo(out)
>>>>   STATE.CONTINUE
>>>> }
>>>> 
>>>> 
>>>> could be changed to
>>>> 
>>>> def onBodyPartReceived(bodyPart: HttpResponseBodyPart) = {
>>>>   reader.read(model, new ByteArrayInputStream(bodyPart.getBodyPartBytes(), 
>>>> base)
>>>>   STATE.CONTINUE
>>>> }
>>>> 
>>>> and so the body part would be consumed by the read in chunks.
>>>> 
>>>>> 
>>>>>> 
>>>>>> There is also RIOT - have you looked parsing the read request to a 
>>>>>> parser in an actor, the catching the Sink<Triple>  interface for the 
>>>>>> return -- that wokrs in an actor style.
>>>>>> 
>>>>>> The key question is what Jena can enable,  this so that possibilities 
>>>>>> can be built on top.  I don't think jena is a good level to pick one 
>>>>>> approach over another as it is in danger of clashing with other choice 
>>>>>> in the application.  Your akka is a good example of one possible choice.
>>>>>> 
>>>>>>> I did open the issue-203 so that when we agree on a solution we could 
>>>>>>> send in
>>>>>>> some patches.
>>>>>> 
>>>>>> Look forward to seeing this,
>>>>>> 
>>>>>>  Andy
>>>>> 
>>>>> Social Web Architect
>>>>> http://bblfish.net/
>>>>> 
>>>> 
>>>> Social Web Architect
>>>> http://bblfish.net/
>>>> 
>>> 
>>> Social Web Architect
>>> http://bblfish.net/
>>> 
>> 
> 
> Social Web Architect
> http://bblfish.net/
> 

Social Web Architect
http://bblfish.net/

Reply via email to