Re: How to read multiple XML from socket: cannot change the protocol (Re: How to handle continuous stream of XML)

Aleksander Slominski Tue, 21 Feb 2006 09:38:19 -0800

Polk, John R. wrote:

Aleksander,
I found a post similar to a problem I am having with your nameassociated with it. I was wondering if you found a clean solution toyour issue. I am writing a client application that is trying to parsemultiple XML responses over the same socket connection. I have aproblem parsing the second response because it starts with"<?xml...". I was hoping to be able to cleanly reset the parserbetween response messages, but I have not succeeded. Do you have anysuggestions?


hi John,

(i CCed [email protected] as the same question was raised - [1]below)

it looks to me that the key thing to notice that your input has twolayers - first there are high order markers (<?xml version="1.0"encoding="UTF-8"?>) that separate actual XML document - higher levellayer is *not* XML so it should be processed without XML passer (goodthat <?xml is reserved and not used except for <![CDATA -to be 100%correct you would need to scan for <![CDATA[ too as CDATA can containanything http://www.w3.org/TR/REC-xml/#sec-cdata-sect)

if you have control over format (it seems you do not?) i would suggestto use something similar to HTTP chunked encoding so you write size ofevery output chunks before writing it and mark some chunk as last in thechain (size=-1) so you know when a document is finished and so on - thiswould lead to much more efficient streaming (and allows you to sendmetadata about files if you put it besides chunk sizes) as no stringpatterns is needed and no worry about CDATA as layering is completelytransparent (using <?xml ...?> is not very good without XML independentmarkers ...)

for the particular enveloping scheme you have (using <?xml ... asmarker): a simple solution i would use is to create a composite reader(or input stream) that is buffered. the first thing it does it sets amark (starts internal buffering) and then you scan its input for next'<?xml version="1.0" encoding="UTF-8"?>\n<root' (as you know it is themarker for 2nd doc) and then it creates new reader that will only allowto read the input until 2nd document beginning or EOF (so it is content1st doc) and passes that reader to xml parser. then the process isrepeated: new mark is set and scan for next <?xml...?> which isbeginning of 3rd document (or EOF).

that is it - it should be fairly efficient as the key for IO performanceis to read data in chunks and avoid copying (here not much is done butmore advanced version could actually work in streaming pipeline and hookinto MultiXmlCompositeReader.read(...) to actually scan for end ofdocument marker (or EOF) so there is no memory overhead as only chunks(possibly multiple read()s to discover marker as it read() may only getit partially) need to be buffered then and not whole document (and onlyone buffering is done in MultiXmlCompositeReader and other buffering inxmlparser but that is hard to avoid)



in pseudo code
   InputStream in =

MultiXmlCompositeReader mr = new MultiXmlCompositeReader(newInputStreamReader( in, "UTF8" ))

   Reader r;
   while( (r = mr.nextDocumentReader()) != null) {
     xmlparser.setInput(r);
     xmlparser.parse() ...
  }

still you should add then CDATA scanning to make it completely correct.

however if you are concerned about correctness and if you want to handleall in one xml parser stream i would instead writeMultiXmlCompositeReader to actually transform stream as follow


ORIG:
  <?xml version="1.0" encoding="UTF-8"?>
  <root/>
  <?xml version="1.0" encoding="UTF-8"?>
  <root2/>

TRANSFORMED
  <super-root>
  <root/>
  <root2/>
  </super-root>

i.e. add wrappers XML elements (<super-root/>) and remove all <?xml...?>when you see '<?xml...?>\n<root'

this can be also done as streaming reader/filter with careful coding(especially if XML content is signed and you want to make sure thatCDATA content with <?xml ...?> is not modified ....)


HTH

alek

[1] Massimo Valla wrote:

Hi Michael.
Thank you for your reply. I definitely agree on your point. Theprotocol is awful. But, unfortunately I cannot change the server sidenor the protocol. I could assume that each document ends when the roottag is closed. So your example could be parsed and received as twodocuments:1st doc:
   <?xml version="1.0" encoding="UTF-8"?>
   <root/>
2nd doc
   <?xml version="1.0" encoding="UTF-8"?>
   <root2/>
leaving out the comment as not beloning to any of the two docs.
The problem is that with Xerces as soon as I receive the first end tagSAX notification, the parser has already buffered part of the otherXML message, so starting another parse command on the inputstream willnot work.How can I set a simular solution to FAQ-11 (of Xerces1) in Xerces2 ??More generally, how can I write a client with Xerces that is able toparse mutiple XML coming from the socket?(I have also tryed other parsers: they allow char-by-char parsing andthey would not close the inputstream after a parse error, so I wouldbe fine using them. But I would very much prefer to stay with Xercesas it is the parser used in Java 1.5...)Thanks a lot,
Massimo
On 2/12/06, Michael Glavassevich <[EMAIL PROTECTED]<mailto:[EMAIL PROTECTED]>> wrote:
> Hi Massimo,
>
> The KeepSocketOpen sample works because the server socket tells theclient
> how many bytes there are in the document. If the server has no protocol
> for communicating the boundaries between XML documents, how can you tell
> where one begins and another ends?
>
> Consider if your client receives this from the socket:
>
> <root/>
> 
> <root2/>
>
> How would you know whether you've received two documents or one not
> well-formed document containing multiple root elements? And if this is
> processed as two documents does the comment belong to the first or the
> second? Only the sender could know that.
>
> Thanks.
>
> Michael Glavassevich
> XML Parser Development
> IBM Toronto Lab
> E-mail: [EMAIL PROTECTED] <mailto:[EMAIL PROTECTED]>
> E-mail: [EMAIL PROTECTED] <mailto:[EMAIL PROTECTED]>
>
> Massimo Valla < [EMAIL PROTECTED]<mailto:[EMAIL PROTECTED]>> wrote on 02/04/2006 09:54:25 PM:
>
> > Hi,
> > I am trying to read multiple XML files from a socket using JAXP 1.3
> > / Xerces-J 2.7.1.
> >
> > Unfortunately the KeepSocketOpen example in Xerces2 Socket Sample (
> > http://xerces.apache.org/xerces2-j/samples-socket.html) does not
> > work for me, because I have no control over the other side of the
> socket.
> >
> > Also FAQ-11 of Xerces1 ( http://xerces.apache.org/xerces-j/faq-
> > write.html#faq-11) does not help anymore, because the
> > StreamingCharFactory class used there to prevent buffering cannot be
> > used in Xerces2 (cannot compile the class).
> >
> > I have been trying to find a solution to this for a while now, but I
> > could come to an end.
> >
> > Can anybody provide a simple example on how to read multiple XML
> > docs from a socket InputStream?
> >
> > Thanks a lot,
> > Massimo
>




--
The best way to predict the future is to invent it - Alan Kay



--
The best way to predict the future is to invent it - Alan Kay


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: How to read multiple XML from socket: cannot change the protocol (Re: How to handle continuous stream of XML)

Reply via email to