Ok, I got the asynchronous parser to work for rdf/xml . Details on the bug report
https://issues.apache.org/jira/browse/JENA-203 Henry PS. I wonder why I can't find this thread in the archive http://mail-archives.apache.org/mod_mbox/incubator-jena-users/201201.mbox/browser On 30 Jan 2012, at 22:36, Henry Story wrote: > > On 30 Jan 2012, at 14:23, Andy Seaborne wrote: > >> Yes, but :-) that's without writing any kind of adaptor code. >> >> I was looking for a way to reuse the existing parser code. If you want to >> start from scratch then it's a different ball game. >> >> There are two cases: >> >> RDF/XML: (Yuk) Jena uses an XML parser - the first point is finding a >> suitable XML parser - the SAX interface means it might be possible to adapt >> to being a pipeline based process. > > Well yes, the good thing about RDF/XML is that I think now nobody cares > anymore. :-) > I am told this Apache Licenced parser is very good. > > https://github.com/FasterXML/aalto-xml > > How difficult would it be to use that? > >> >> The rest: Turtle parsers are quite easy to write. In fact, the actually >> parser isn't really the bulk of the work. >> >> The purest actor-style implementation needs to spit out the parsing phases: >> bytes to chars, chars to tokens, tokens to triples. Each of those steps is >> a small state machine but it looks a whole lot easier to write as separate >> FSMs. Even UTF-8 chars can be split across byte buffer boundaries. > > I doing some research there. > >> >> Practical points: >> >> 1/ for all the small documents (say, less than 50K) it might be simpler to >> gather the bytes together and parse whole documents. Then devote a thread >> to large documents - assumes you get Content-Length. This isn't as ideal as >> a compete rewrite but it's less work. Isn't thread stack size is key >> determinant of space used? > > yes, that's an ugly band aid, but I'll use that in the mean time, as I would > like to get more familiar with actor based programming. > >> 2/ Have X threads (where X ~ # cores), use a executor pool requests >> together. The far wil start seding and it will be buffered in the low >> levels. There aren't any extra CPU cycles to go round so while it's batch-y >> it isn't going to go fast with more active parsers. > > I think without getting the parsers to be non blocking, everything else is > just going to be ugly and inefficient. Getting the parsers to be non blocking > will make everything else just clean and seamless. > > For example one could easily create a proxy that could proxy 1 GB of RDF > files and only use up a few kbytes of memory, by simply reading in triples > and spitting them out in another format on the other end, before even the > first document had finished parsing. > > > >> >> I am interested in the question on rendezvous still by the way - how does >> the app want to be notified parsing has finished and does it not touch the >> model during this time? >> >> Andy >> >> On 30/01/12 12:57, Henry Story wrote: >>> So I wrote out a gist that shows how one should be able to use Jena Parsers >>> It is here: >>> >>> https://gist.github.com/1704255 >>> >>> But I get the exception >>> >>> ERROR (WebFetcher.scala:59) : org.xml.sax.SAXParseException; systemId: >>> http://bblfish.net/people/henry/card.rdf; lineNumber: 134; columnNumber: >>> 44; XML document structures must start and end within the same entity. >>> com.hp.hpl.jena.shared.JenaException: org.xml.sax.SAXParseException; >>> systemId: http://bblfish.net/people/henry/card.rdf; lineNumber: 134; >>> columnNumber: 44; XML document structures must start and end within the >>> same entity. >>> at >>> com.hp.hpl.jena.rdf.model.impl.RDFDefaultErrorHandler.fatalError(RDFDefaultErrorHandler.java:60) >>> at >>> com.hp.hpl.jena.rdf.arp.impl.ARPSaxErrorHandler.fatalError(ARPSaxErrorHandler.java:51) >>> at com.hp.hpl.jena.rdf.arp.impl.XMLHandler.warning(XMLHandler.java:211) >>> at >>> com.hp.hpl.jena.rdf.arp.impl.XMLHandler.fatalError(XMLHandler.java:241) >>> at o >>> >>> As expected, because there one cannot pass partial documents to the reader. >>> >>> Henry >>> >>> >>> On 29 Jan 2012, at 23:52, Henry Story wrote: >>> >>>> >>>> On 29 Jan 2012, at 23:28, Henry Story wrote: >>>> >>>>> >>>>> On 29 Jan 2012, at 23:04, Andy Seaborne wrote: >>>>> >>>>>> Hi Henry, >>>>>> >>>>>> On 29/01/12 21:40, Henry Story wrote: >>>>>>> [ I just opened a bug report for this, but it was suggested that a wider >>>>>>> discussion on how to do it would be useful on this list. ] >>>>>> >>>>>> The thread of interest is: >>>>>> >>>>>> http://www.mail-archive.com/[email protected]/msg02451.html >>>>>> >>>>>>> Unless I am mistaken the only way to parse some content is using >>>>>>> methods that use an >>>>>>> InputStream such as this: >>>>>>> >>>>>>> val m = ModelFactory.createDefaultModel() >>>>>>> m.getReader(lang.jenaLang).read(m, in, base.toString) >>>>>> >>>>>> As already commented on the thread, passing the reader to an actor >>>>>> allows async reading. Readers are configurable - you can have anything >>>>>> you like. No reason why the RDFReader can't be using async NIO. >>>>> >>>>> Mhh, can I call at time t1 >>>>> >>>>> reader.read( model, inputStream, base); >>>>> >>>>> with an inputStream that only contains a chunk of the data? And then call >>>>> it again with >>>>> another chunk of the data later with a newly filled input stream that >>>>> contains the next segment >>>>> of the data? >>>>> >>>>> reader.read( model, inputStream2, base); >>>>> >>>>> It says nothing about that in the documentation, so I just assumed it >>>>> does not work... >>>> >>>> Well I did look at the code (but perhaps not deeply enough, and only the >>>> released >>>> version of Jena). From that I got the feeling that one has to send one >>>> whole RDF >>>> document down an input stream at a time. >>>> >>>> If one cannot send chunks to the reader then essentially the thread that >>>> calls the >>>> read(...) method above will block until the whole document is read in. >>>> Even if an >>>> actor calls that method, the actor will then block the thread that it is >>>> executing >>>> in until it is finished. So actors don't help (unless there is some magic >>>> I don't >>>> know about). Now if the server serving the document is serving it at 56 >>>> bauds, really >>>> slowly, then one thread could be used up even though it is producing very >>>> very >>>> little work. >>>> >>>> If on the other hand I could send partial pieces of XML documents down >>>> different >>>> input streams and different times, then the NIO thread could call the >>>> reader >>>> every time it received some data. For example in the code I was writing >>>> here using the >>>> http-async-client https://gist.github.com/1701141 >>>> >>>> The method I have now on line 39-42 >>>> >>>> def onBodyPartReceived(bodyPart: HttpResponseBodyPart) = { >>>> bodyPart.writeTo(out) >>>> STATE.CONTINUE >>>> } >>>> >>>> >>>> could be changed to >>>> >>>> def onBodyPartReceived(bodyPart: HttpResponseBodyPart) = { >>>> reader.read(model, new ByteArrayInputStream(bodyPart.getBodyPartBytes(), >>>> base) >>>> STATE.CONTINUE >>>> } >>>> >>>> and so the body part would be consumed by the read in chunks. >>>> >>>>> >>>>>> >>>>>> There is also RIOT - have you looked parsing the read request to a >>>>>> parser in an actor, the catching the Sink<Triple> interface for the >>>>>> return -- that wokrs in an actor style. >>>>>> >>>>>> The key question is what Jena can enable, this so that possibilities >>>>>> can be built on top. I don't think jena is a good level to pick one >>>>>> approach over another as it is in danger of clashing with other choice >>>>>> in the application. Your akka is a good example of one possible choice. >>>>>> >>>>>>> I did open the issue-203 so that when we agree on a solution we could >>>>>>> send in >>>>>>> some patches. >>>>>> >>>>>> Look forward to seeing this, >>>>>> >>>>>> Andy >>>>> >>>>> Social Web Architect >>>>> http://bblfish.net/ >>>>> >>>> >>>> Social Web Architect >>>> http://bblfish.net/ >>>> >>> >>> Social Web Architect >>> http://bblfish.net/ >>> >> > > Social Web Architect > http://bblfish.net/ > Social Web Architect http://bblfish.net/
