Well, I gave it a try, and reviewing what Vysper actually did makes it seem a lot more manageable. There really are a handful of cases, and most of them are plainly ignored (comment,pi,doctype), most are just text handling (cdata,text). The more complicated one is element-tag, which has several sub-states (elementname, attributename, attributevalue). But vysper ignores the element-tag sub-states, and simply waits until element-tag is all there before parsing name/attrs (<el attr="attr"> or <el attr="attr"/>) (which is a good first implementation of this, but can be easily enhanced too).

I wrote up a draft version of a SAX parser for Mina last week, which I think is not a bad representation of what I'm thinking. Since a SAX parser is free to call back to a listener as it sees fit. Then I was thinking we could create another codec/processor that would have various options on how to convert the sax event stream into a DOM event stream. Since some applications want a full document (only using it for NIO parsing), while other applications want a unbounded stream of dom elements ( like vysper/xmpp ).

Not sure where to put up the code to get comments.. maybe I should learn github. :)



On 10/19/09 10:39 PM, Ashish wrote:
Actually, this very problem was the run I discussed with Bernd last spring
during ApacheCon, as I was looking for a XML parsing supporting stops in the
middle of a XML tag. We need some XML parser that support this kind of
partial data, and can recover from it. Not simple ...


Mine was working fine partially, though I didn't tested it for all the
use cases.
Had tried both the approaches, first was to extend an external parser
to support this. It worked for simple cases.
The second was a bit dumb solution, but worked fine. Manually just
look for start and end (root elements) of XML.
Once complete xml is received, slice the buffer and pass it to a full
blown parser to do actual XML parsing. It kept life real simple.
However, the problem was less than solved, as I was unable to handle
misbehaving clients, like never sending end element, and starting a
new XML. Though rare but implementation has to be robust enough to
deal with them.

I will see if I still have the code :-(

A straight out of box solution won't work, as a TCP packet can have
end of one xml and start of next one :-)
This was the reason why I opted for dumb approach. Else we make our
parser to slice the complete xml and leave the unfinished data in
buffer. This is where the real challenge lies.

What I was thinking was to reduce two passes. Modify XML parser to
work on packets or on pure stream. Packets approach would be more
challenging. Parse the packet, keep the XML tree, as and when the tree
is complete, return the XML tree. Or pass on packets to parser and let
it parse. Catch uncomplete xml/data exception and store the data in
memory or file system. Once it completes the xml, get the xml, slice
the stream.

Have to stop here else it shall become an essay :-)

Good Luck


Reply via email to