Re: still unsolved: parsing multiple XML from socket

Aleksander Slominski Sun, 05 Mar 2006 22:09:53 -0800

Peter Hendry wrote:
>
>
> Aleksander Slominski wrote:
>> Peter Hendry wrote:
>>   
>>> Aleksander Slominski wrote:
>>>     
>>>> Peter Hendry wrote:
>>>>   
>>>>       
>>>>> More efficient still would be to keep track of the last '<' and
>>>>> whether it had a '/' after it - then allow returning past '>' if the
>>>>> last '<' didn't have a '/' (confused? :-) ).
>>>>>     
>>>>>         
>>>> not sure if it is going to work with CDATA section (they can contain
>>>> "unbalanced" XML)
>>>>   
>>>>       
>>> Why not? It is just an optimization. You can't have CDATA in an end
>>> tag so you should not go past the '>' of the end tag whether there is
>>> CDATA or not.
>>>     
>> CDATA section can have unbalanced XML so beynd tracking of < and /> you
>> need to track CDATA sections as well AFACS.
>>
>>   
>
> This is not true. From previous mails I would assume that a SAX parser
> is being used. 
where did this assumption come from (i must have missed it in the long
thread :) ) - if you look on the top post it talks about tracking  '<'
and '</' ... there is no need to track them if you use XML parser (such
as SAX API) then you just need to track indentationLevel (startElement:
++indentationLevel; endElement: --indentationLevel;
if(indentationLevel==0) { parser is on the last end element };)
> In that you will received startElement and endElement events. It is
> those that are being matched up. What is in the CDATA doesn't matter.
> Until you get the endElement that matches the first startElement you
> will continue to honor read() requests but always return up to the
> next '>' (optionally optimizing with '</' checking as well). If a
> CDATA contains a '</' or any unmatched '<' or '>' it doesn't matter as
> you will read past them on the next read. Within a real end tag
> (outside cdata) you cannot get any other '>' or '<' so there is no
> problem.
>
> The optimisation would also have to account for empty tags '<x/>'.
>
> An example,
>
>   <root><x><![[CDATA[<x<t/>///>>>/></x>]]></x></root>
>
> Without optimization, the reads would return the following
>
>     <root>
>     <x>
>     <![[CDATA[<x<t/>
>     ///>
>     >
>     >
>     />
>     </x>
>     ]]>
>     </x>
>     </root>
>
> at which point the endElement even would return the depth back to 0
> and so it is known it is the end of the document.
>
> With some optimization - return if '>' and '/' has been seen since
> last '<'
>
>     <root><x><![[CDATA[<x<t/>
>     ///>
>     >>/>
>     </x>
>     ]]></x>
>     </root>
>
> and again at this point endElement is called and returns the depth to
> 0 so the end of the document has been reached.
>
> I still don't see the need to track CDATA in this?
if you use XML parser that is true as it deals with actual _parsing_.
tracking < and </ can be much faster though in some situations  when you
pre-parse or scan input.


best,

alek

-- 
The best way to predict the future is to invent it - Alan Kay


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: still unsolved: parsing multiple XML from socket

Reply via email to