On Nov 2, 2012, at 7:24 AM, Michael Van Canneyt <mich...@freepascal.org> wrote:

> 
> 
> On Thu, 1 Nov 2012, Andrew Brunner wrote:
> 
>> I'm having a problem getting the XML parser to read.
>> 
>> Is there any way I can get the attached program to work by changing a 
>> parsing option to one less strict.  My XML documents get over 1-2 GBs since 
>> they represent files.  So having to convert /scan each byte is unacceptable.
> 
> I suggest you revert to something else than XML, if that's an option.
> XML is notoriously slow to load.
> 

I don't know if at this point I am able to switch.  It's not practical. I could 
just grab PFC XML components and derive something outside FPC project scope.  


>> 
>> Is there another XML parser component that can establish a DOM?  Or is this 
>> a bug in the fpc XML component?
> 
> This is not a bug, it is prescribed behaviour.

The function AnsiToUtf8 is supposed to convert data to utf.   So the string in 
the sample should have the proper UTF8 encoding.  And the parser should be able 
to read it. 

In the past, I was able to parse ANSI strings but only after converting to 
UTF8.  But the attached program fails. 100%

> 
> The XML components must work on any XML document that exists out there.
> As a consequence, the codepage in the XML must be checked and converted if 
> need be.
> 
The input data in the example attached is converted.  


> Imagine you have a XML file encoded in UTF16, and we assume it's UTF-8. The 
> resulting DOM tree would be unusable.
> 

True. 


>> Any help or feedback is entirely welcome and needed.  This data in currently 
>> in at least 1 stream and failing my cloud desktop sync application.
> 
> You'll have to write your own XML handling routines which work only with the 
> codepage the XML is in. And be prepared that they will fail as soon as the 
> encoding of the XML changes.
> 

Right.  But converting the data to say UTF8 should have worked.  I have 
explicitly set the encoding to UTF8 in the header.  

>> 
>> I would really love an option to disable XML byte for byte checking during 
>> parsing.

I think it would be a good solution and even prove faster in controlled 
environments.  Plus all data is stored as widestrings in the DOM. 

The first question I have is if there was such an option would the patch be 
accepted. 

The next question is what is the problem with the uf8 routine that it left the 
offending byte sequence intact without converting the bytes in my sample data?


_______________________________________________
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/mailman/listinfo/fpc-devel

Reply via email to