Polk, John R. wrote:
Aleksander,
I found a post similar to a problem I am having with your name
associated with it. I was wondering if you found a clean solution to
your issue. I am writing a client application that is trying to parse
multiple XML responses over the same socket connection. I have a
problem parsing the second response because it starts with
"<?xml...". I was hoping to be able to cleanly reset the parser
between response messages, but I have not succeeded. Do you have any
suggestions?
hi John,
(i CCed [email protected] as the same question was raised - [1]
below)
it looks to me that the key thing to notice that your input has two
layers - first there are high order markers (<?xml version="1.0"
encoding="UTF-8"?>) that separate actual XML document - higher level
layer is *not* XML so it should be processed without XML passer (good
that <?xml is reserved and not used except for <![CDATA -to be 100%
correct you would need to scan for <![CDATA[ too as CDATA can contain
anything http://www.w3.org/TR/REC-xml/#sec-cdata-sect)
if you have control over format (it seems you do not?) i would suggest
to use something similar to HTTP chunked encoding so you write size of
every output chunks before writing it and mark some chunk as last in the
chain (size=-1) so you know when a document is finished and so on - this
would lead to much more efficient streaming (and allows you to send
metadata about files if you put it besides chunk sizes) as no string
patterns is needed and no worry about CDATA as layering is completely
transparent (using <?xml ...?> is not very good without XML independent
markers ...)
for the particular enveloping scheme you have (using <?xml ... as
marker): a simple solution i would use is to create a composite reader
(or input stream) that is buffered. the first thing it does it sets a
mark (starts internal buffering) and then you scan its input for next
'<?xml version="1.0" encoding="UTF-8"?>\n<root' (as you know it is the
marker for 2nd doc) and then it creates new reader that will only allow
to read the input until 2nd document beginning or EOF (so it is content
1st doc) and passes that reader to xml parser. then the process is
repeated: new mark is set and scan for next <?xml...?> which is
beginning of 3rd document (or EOF).
that is it - it should be fairly efficient as the key for IO performance
is to read data in chunks and avoid copying (here not much is done but
more advanced version could actually work in streaming pipeline and hook
into MultiXmlCompositeReader.read(...) to actually scan for end of
document marker (or EOF) so there is no memory overhead as only chunks
(possibly multiple read()s to discover marker as it read() may only get
it partially) need to be buffered then and not whole document (and only
one buffering is done in MultiXmlCompositeReader and other buffering in
xmlparser but that is hard to avoid)
in pseudo code
InputStream in =
MultiXmlCompositeReader mr = new MultiXmlCompositeReader(new
InputStreamReader( in, "UTF8" ))
Reader r;
while( (r = mr.nextDocumentReader()) != null) {
xmlparser.setInput(r);
xmlparser.parse() ...
}
still you should add then CDATA scanning to make it completely correct.
however if you are concerned about correctness and if you want to handle
all in one xml parser stream i would instead write
MultiXmlCompositeReader to actually transform stream as follow
ORIG:
<?xml version="1.0" encoding="UTF-8"?>
<root/>
<?xml version="1.0" encoding="UTF-8"?>
<root2/>
TRANSFORMED
<super-root>
<root/>
<root2/>
</super-root>
i.e. add wrappers XML elements (<super-root/>) and remove all <?xml...?>
when you see '<?xml...?>\n<root'
this can be also done as streaming reader/filter with careful coding
(especially if XML content is signed and you want to make sure that
CDATA content with <?xml ...?> is not modified ....)
HTH
alek
[1] Massimo Valla wrote:
Hi Michael.
Thank you for your reply. I definitely agree on your point. The
protocol is awful. But, unfortunately I cannot change the server side
nor the protocol. I could assume that each document ends when the root
tag is closed. So your example could be parsed and received as two
documents:
1st doc:
<?xml version="1.0" encoding="UTF-8"?>
<root/>
2nd doc
<?xml version="1.0" encoding="UTF-8"?>
<root2/>
leaving out the comment as not beloning to any of the two docs.
The problem is that with Xerces as soon as I receive the first end tag
SAX notification, the parser has already buffered part of the other
XML message, so starting another parse command on the inputstream will
not work.
How can I set a simular solution to FAQ-11 (of Xerces1) in Xerces2 ??
More generally, how can I write a client with Xerces that is able to
parse mutiple XML coming from the socket?
(I have also tryed other parsers: they allow char-by-char parsing and
they would not close the inputstream after a parse error, so I would
be fine using them. But I would very much prefer to stay with Xerces
as it is the parser used in Java 1.5...)
Thanks a lot,
Massimo
On 2/12/06, Michael Glavassevich <[EMAIL PROTECTED]
<mailto:[EMAIL PROTECTED]>> wrote:
> Hi Massimo,
>
> The KeepSocketOpen sample works because the server socket tells the
client
> how many bytes there are in the document. If the server has no protocol
> for communicating the boundaries between XML documents, how can you tell
> where one begins and another ends?
>
> Consider if your client receives this from the socket:
>
> <root/>
> <!-- comment -->
> <root2/>
>
> How would you know whether you've received two documents or one not
> well-formed document containing multiple root elements? And if this is
> processed as two documents does the comment belong to the first or the
> second? Only the sender could know that.
>
> Thanks.
>
> Michael Glavassevich
> XML Parser Development
> IBM Toronto Lab
> E-mail: [EMAIL PROTECTED] <mailto:[EMAIL PROTECTED]>
> E-mail: [EMAIL PROTECTED] <mailto:[EMAIL PROTECTED]>
>
> Massimo Valla < [EMAIL PROTECTED]
<mailto:[EMAIL PROTECTED]>> wrote on 02/04/2006 09:54:25 PM:
>
> > Hi,
> > I am trying to read multiple XML files from a socket using JAXP 1.3
> > / Xerces-J 2.7.1.
> >
> > Unfortunately the KeepSocketOpen example in Xerces2 Socket Sample (
> > http://xerces.apache.org/xerces2-j/samples-socket.html) does not
> > work for me, because I have no control over the other side of the
> socket.
> >
> > Also FAQ-11 of Xerces1 ( http://xerces.apache.org/xerces-j/faq-
> > write.html#faq-11) does not help anymore, because the
> > StreamingCharFactory class used there to prevent buffering cannot be
> > used in Xerces2 (cannot compile the class).
> >
> > I have been trying to find a solution to this for a while now, but I
> > could come to an end.
> >
> > Can anybody provide a simple example on how to read multiple XML
> > docs from a socket InputStream?
> >
> > Thanks a lot,
> > Massimo
>
--
The best way to predict the future is to invent it - Alan Kay
--
The best way to predict the future is to invent it - Alan Kay
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]