Re: [fpc-devel] XML Components
I just added this Prepare method to my database API. class function XML.IsInvalid(var Value:Byte):boolean; begin Result:=(Value9) or (Value=11) or (Value=12) or ( (Value13) and (Value32)); end; class function XML.Prepare(var sInput:string; Refactor:TStream):string; var bChar:byte; iLcv:Int64; iLen:Int64; sReplace:string; begin Refactor.Size:=0; for iLcv:=1 to System.Length(sInput) do begin bChar:=Byte(sInput[iLcv]); if IsInvalid(bChar) then begin sReplace:=Concat('#',IntToStr(bChar),';'); iLen:=System.Length(sReplace); Refactor.Write(sReplace[1],iLen); end else Refactor.Write(bChar,1); end; System.SetLength(Result,Refactor.Size); if Refactor.Size0 then Refactor.Read(Result[1],Refactor.Size); Refactor.Size:=0; end; The question is, what is going to happen when the encoding is UTF8 or UTF16? Will this code allow bytes to go by without messing them all up? -- Andrew Brunner Aurawin LLC 512.574.6298 http://aurawin.com/ Aurawin is a great new way to store, share, and enjoy your photos, videos, music and more. ___ fpc-devel maillist - fpc-devel@lists.freepascal.org http://lists.freepascal.org/mailman/listinfo/fpc-devel
Re: [fpc-devel] XML Components
Andrew Brunner schrieb: I just added this Prepare method to my database API. class function XML.IsInvalid(var Value:Byte):boolean; begin Result:=(Value9) or (Value=11) or (Value=12) or ( (Value13) and (Value32)); end; [...] The question is, what is going to happen when the encoding is UTF8 or UTF16? Will this code allow bytes to go by without messing them all up? The control characters, precisely the full ASCII character set, has the same char values in *every* Unicode encoding. In UTF-8 encoding the escape sequences have bytes all above 127, no confusion possible with control characters. In UTF-16 encoding you'll have problems when converting an WideString or UnicodeString into an AnsiString. I'd suggest that you use the predefined DOM string and char types, which IMO currently are WideString/WideChar. DoDi ___ fpc-devel maillist - fpc-devel@lists.freepascal.org http://lists.freepascal.org/mailman/listinfo/fpc-devel
Re: [fpc-devel] XML Components
On Thu, 1 Nov 2012, Andrew Brunner wrote: I'm having a problem getting the XML parser to read. Is there any way I can get the attached program to work by changing a parsing option to one less strict. My XML documents get over 1-2 GBs since they represent files. So having to convert /scan each byte is unacceptable. I suggest you revert to something else than XML, if that's an option. XML is notoriously slow to load. Is there another XML parser component that can establish a DOM? Or is this a bug in the fpc XML component? This is not a bug, it is prescribed behaviour. The XML components must work on any XML document that exists out there. As a consequence, the codepage in the XML must be checked and converted if need be. Imagine you have a XML file encoded in UTF16, and we assume it's UTF-8. The resulting DOM tree would be unusable. Any help or feedback is entirely welcome and needed. This data in currently in at least 1 stream and failing my cloud desktop sync application. You'll have to write your own XML handling routines which work only with the codepage the XML is in. And be prepared that they will fail as soon as the encoding of the XML changes. I would really love an option to disable XML byte for byte checking during parsing. That's not likely to happen. Michael. ___ fpc-devel maillist - fpc-devel@lists.freepascal.org http://lists.freepascal.org/mailman/listinfo/fpc-devel
Re: [fpc-devel] XML Components
On Nov 2, 2012, at 7:24 AM, Michael Van Canneyt mich...@freepascal.org wrote: On Thu, 1 Nov 2012, Andrew Brunner wrote: I'm having a problem getting the XML parser to read. Is there any way I can get the attached program to work by changing a parsing option to one less strict. My XML documents get over 1-2 GBs since they represent files. So having to convert /scan each byte is unacceptable. I suggest you revert to something else than XML, if that's an option. XML is notoriously slow to load. I don't know if at this point I am able to switch. It's not practical. I could just grab PFC XML components and derive something outside FPC project scope. Is there another XML parser component that can establish a DOM? Or is this a bug in the fpc XML component? This is not a bug, it is prescribed behaviour. The function AnsiToUtf8 is supposed to convert data to utf. So the string in the sample should have the proper UTF8 encoding. And the parser should be able to read it. In the past, I was able to parse ANSI strings but only after converting to UTF8. But the attached program fails. 100% The XML components must work on any XML document that exists out there. As a consequence, the codepage in the XML must be checked and converted if need be. The input data in the example attached is converted. Imagine you have a XML file encoded in UTF16, and we assume it's UTF-8. The resulting DOM tree would be unusable. True. Any help or feedback is entirely welcome and needed. This data in currently in at least 1 stream and failing my cloud desktop sync application. You'll have to write your own XML handling routines which work only with the codepage the XML is in. And be prepared that they will fail as soon as the encoding of the XML changes. Right. But converting the data to say UTF8 should have worked. I have explicitly set the encoding to UTF8 in the header. I would really love an option to disable XML byte for byte checking during parsing. I think it would be a good solution and even prove faster in controlled environments. Plus all data is stored as widestrings in the DOM. The first question I have is if there was such an option would the patch be accepted. The next question is what is the problem with the uf8 routine that it left the offending byte sequence intact without converting the bytes in my sample data? ___ fpc-devel maillist - fpc-devel@lists.freepascal.org http://lists.freepascal.org/mailman/listinfo/fpc-devel
Re: [fpc-devel] XML Components
On Fri, 2 Nov 2012, Andrew Brunner wrote: As a consequence, the codepage in the XML must be checked and converted if need be. The input data in the example attached is converted. There is no attachment to your mail. Imagine you have a XML file encoded in UTF16, and we assume it's UTF-8. The resulting DOM tree would be unusable. True. Any help or feedback is entirely welcome and needed. This data in currently in at least 1 stream and failing my cloud desktop sync application. You'll have to write your own XML handling routines which work only with the codepage the XML is in. And be prepared that they will fail as soon as the encoding of the XML changes. Right. But converting the data to say UTF8 should have worked. I have explicitly set the encoding to UTF8 in the header. Without looking at the data and the errors you get, it's impossible to say anything useful. I would really love an option to disable XML byte for byte checking during parsing. I think it would be a good solution and even prove faster in controlled environments. Plus all data is stored as widestrings in the DOM. The first question I have is if there was such an option would the patch be accepted. I don't see how you can fix the problem. If the input is UTF8, and the result must be converted to a widestring for the DOM, then a conversion MUST take place, there is no way to avoid it. And a conversion means scanning the input byte for byte. In each case, the input must be scanned byte for byte anyway, to detect all the tags. That's what makes XML slow and unusable for large amount of data. The next question is what is the problem with the uf8 routine that it left the offending byte sequence intact without converting the bytes in my sample data? Without error message, it is impossible to tell. Michael. ___ fpc-devel maillist - fpc-devel@lists.freepascal.org http://lists.freepascal.org/mailman/listinfo/fpc-devel
Re: [fpc-devel] XML Components
On 11/02/2012 08:08 AM, Michael Van Canneyt wrote: There is no attachment to your mail. The attachment was in my first posting. But just in cease I've attached it again. Please feel free to check it out. The example is stripped of most of the xml code that was successfully parsed. This was from about a 5mb stream. I grabbed the xml code surrounding the error position. Thanks Michael. program unknown; {$mode objfpc}{$H+} uses {$IFDEF UNIX}{$IFDEF UseCThreads} cthreads, {$ENDIF}{$ENDIF} Classes,DOM, XMLRead; procedure Test(); var FXMLParser : TDOMParser; FXMLDocument : TXMLDocument; FXMLSource : TXMLInputSource; sData: String; begin sData:=Concat( '?xml version=1.0 encoding='+ 'UTF-8', // native platform is LATIN1 '?', 'value', '![CDATA[$B0u testt]]', '/value' ); sData:=System.AnsiToUtf8(sData); FXMLParser:=TDOMParser.Create(); Try FXMLSource:=TXMLInputSource.Create(sData); Try FXMLParser.Parse(FXMLSource,FXMLDocument); Try Finally FXMLDocument.Free(); end; Finally FXMLSource.Free(); end; Finally FXMLParser.Free(); end; end; begin Test(); end. ___ fpc-devel maillist - fpc-devel@lists.freepascal.org http://lists.freepascal.org/mailman/listinfo/fpc-devel
Re: [fpc-devel] XML Components
Andrew Brunner atbrun...@aurawin.com hat am 2. November 2012 um 13:59 geschrieben: On Nov 2, 2012, at 7:24 AM, Michael Van Canneyt mich...@freepascal.org wrote: On Thu, 1 Nov 2012, Andrew Brunner wrote: I'm having a problem getting the XML parser to read. Is there any way I can get the attached program to work by changing a parsing option to one less strict. My XML documents get over 1-2 GBs since they represent files. So having to convert /scan each byte is unacceptable. I suggest you revert to something else than XML, if that's an option. XML is notoriously slow to load. I don't know if at this point I am able to switch. It's not practical. I could just grab PFC XML components and derive something outside FPC project scope. Have you tried the xml units in Lazarus? They are 99% the fpc units just ported to use UTF8 strings instead of widestrings. Mattias ___ fpc-devel maillist - fpc-devel@lists.freepascal.org http://lists.freepascal.org/mailman/listinfo/fpc-devel
Re: [fpc-devel] XML Components
02.11.2012 17:08, Michael Van Canneyt пишет: On Fri, 2 Nov 2012, Andrew Brunner wrote: I think it would be a good solution and even prove faster in controlled environments. Plus all data is stored as widestrings in the DOM. The first question I have is if there was such an option would the patch be accepted. I don't see how you can fix the problem. If the input is UTF8, and the result must be converted to a widestring for the DOM, then a conversion MUST take place, there is no way to avoid it. And a conversion means scanning the input byte for byte. In each case, the input must be scanned byte for byte anyway, to detect all the tags. That's what makes XML slow and unusable for large amount of data. The next question is what is the problem with the uf8 routine that it left the offending byte sequence intact without converting the bytes in my sample data? Without error message, it is impossible to tell. In this case, the issue is not encoding, but literal ESC (#27) code used in data. XML specification does not allow codepoints below 32, except TAB,CR and LF, to appear in data, both in literal and escaped forms. In other words, XML is wrong technology to work with binary data, unless it is encoded into textual form (Base64 or alike). Regards, Sergei ___ fpc-devel maillist - fpc-devel@lists.freepascal.org http://lists.freepascal.org/mailman/listinfo/fpc-devel
Re: [fpc-devel] XML Components
On Nov 2, 2012, at 8:32 AM, Sergei Gorelkin sergei_gorel...@mail.ru wrote: In this case, the issue is not encoding, but literal ESC (#27) code used in data. XML specification does not allow codepoints below 32, except TAB,CR and LF, to appear in data, both in literal and escaped forms. In other words, XML is wrong technology to work with binary data, unless it is encoded into textual form (Base64 or alike). Ok. The data comes from a summary function that grabs a few pieces of an email message on this case. The subject an top 2 lines of the message. Email is text so it would most likely be memory corruption as the source of any low order bytes like 0. But actual file data when streamed is done so via MIME. ___ fpc-devel maillist - fpc-devel@lists.freepascal.org http://lists.freepascal.org/mailman/listinfo/fpc-devel
Re: [fpc-devel] XML Components
Sergei Gorelkin sergei_gorel...@mail.ru hat am 2. November 2012 um 14:32 geschrieben: 02.11.2012 17:08, Michael Van Canneyt пишет: On Fri, 2 Nov 2012, Andrew Brunner wrote: I think it would be a good solution and even prove faster in controlled environments. Plus all data is stored as widestrings in the DOM. The first question I have is if there was such an option would the patch be accepted. I don't see how you can fix the problem. If the input is UTF8, and the result must be converted to a widestring for the DOM, then a conversion MUST take place, there is no way to avoid it. And a conversion means scanning the input byte for byte. In each case, the input must be scanned byte for byte anyway, to detect all the tags. That's what makes XML slow and unusable for large amount of data. The next question is what is the problem with the uf8 routine that it left the offending byte sequence intact without converting the bytes in my sample data? Without error message, it is impossible to tell. In this case, the issue is not encoding, but literal ESC (#27) code used in data. XML specification does not allow codepoints below 32, except TAB,CR and LF, to appear in data, both in literal and escaped forms. Actually the specification only defines legal characters and that processors must accept them. It does not say what to do with the other characters. In other words, XML is wrong technology to work with binary data, unless it is encoded into textual form (Base64 or alike). True. Mattias ___ fpc-devel maillist - fpc-devel@lists.freepascal.org http://lists.freepascal.org/mailman/listinfo/fpc-devel
Re: [fpc-devel] XML Components
02.11.2012 17:44, Mattias Gaertner пишет: Sergei Gorelkin sergei_gorel...@mail.ru hat am 2. November 2012 um 14:32 geschrieben: In this case, the issue is not encoding, but literal ESC (#27) code used in data. XML specification does not allow codepoints below 32, except TAB,CR and LF, to appear in data, both in literal and escaped forms. Actually the specification only defines legal characters and that processors must accept them. It does not say what to do with the other characters. Besides specification, there is a test suite containing lots of tests with illegal characters and expecting them all to fail. Regards, Sergei ___ fpc-devel maillist - fpc-devel@lists.freepascal.org http://lists.freepascal.org/mailman/listinfo/fpc-devel
Re: [fpc-devel] XML Components
So where in the specs does it say that parsers must reject certain byte sequences between cdata tags excepting XML tags. If this is supported by specs it would help shape a viable solution. On Nov 2, 2012, at 9:01 AM, Sergei Gorelkin sergei_gorel...@mail.ru wrote: 02.11.2012 17:44, Mattias Gaertner пишет: Sergei Gorelkin sergei_gorel...@mail.ru hat am 2. November 2012 um 14:32 geschrieben: In this case, the issue is not encoding, but literal ESC (#27) code used in data. XML specification does not allow codepoints below 32, except TAB,CR and LF, to appear in data, both in literal and escaped forms. Actually the specification only defines legal characters and that processors must accept them. It does not say what to do with the other characters. Besides specification, there is a test suite containing lots of tests with illegal characters and expecting them all to fail. Regards, Sergei ___ fpc-devel maillist - fpc-devel@lists.freepascal.org http://lists.freepascal.org/mailman/listinfo/fpc-devel ___ fpc-devel maillist - fpc-devel@lists.freepascal.org http://lists.freepascal.org/mailman/listinfo/fpc-devel
Re: [fpc-devel] XML Components
On Fri, 2 Nov 2012, Andrew Brunner wrote: So where in the specs does it say that parsers must reject certain byte sequences between cdata tags excepting XML tags. If this is supported by specs it would help shape a viable solution. Where did you get that it is supported ? The specs list the allowed characters. Section 2.2: [Definition: A character is an atomic unit of text as specified by ISO/IEC 10646:2000 [ISO/IEC 10646]. Legal characters are tab, carriage return, line feed, and the legal characters of Unicode and ISO/IEC 10646. Therefor, any character not in the list is not a legal character and should be rejected. Speaking from painful experience: Relaxing this and silently ignoring these illegal characters will at some point lead to problems when you encounter a system that enforces the rules more strictly. You will then wonder how it can be that the XML is considered invalid when your own XML code handles it so nicely. Whereas now, you already know. To show that this is not just idle talk: I ran your XML through the MS-XML parser, and it complained just as well about an illegal character in the input. Forewarned is forearmed. Michael. ___ fpc-devel maillist - fpc-devel@lists.freepascal.org http://lists.freepascal.org/mailman/listinfo/fpc-devel
Re: [fpc-devel] XML Components
Den 02-11-2012 14:32, Sergei Gorelkin skrev: 02.11.2012 17:08, Michael Van Canneyt пишет: On Fri, 2 Nov 2012, Andrew Brunner wrote: I think it would be a good solution and even prove faster in controlled environments. Plus all data is stored as widestrings in the DOM. The first question I have is if there was such an option would the patch be accepted. I don't see how you can fix the problem. If the input is UTF8, and the result must be converted to a widestring for the DOM, then a conversion MUST take place, there is no way to avoid it. And a conversion means scanning the input byte for byte. In each case, the input must be scanned byte for byte anyway, to detect all the tags. That's what makes XML slow and unusable for large amount of data. The next question is what is the problem with the uf8 routine that it left the offending byte sequence intact without converting the bytes in my sample data? Without error message, it is impossible to tell. In this case, the issue is not encoding, but literal ESC (#27) code used in data. XML specification does not allow codepoints below 32, except TAB,CR and LF, to appear in data, both in literal and escaped forms. In other words, XML is wrong technology to work with binary data, unless it is encoded into textual form (Base64 or alike). Regards, Sergei ___ fpc-devel maillist - fpc-devel@lists.freepascal.org http://lists.freepascal.org/mailman/listinfo/fpc-devel XML 1.1 allows anything down to #1, but the current parser doesn't seem to allow that. I guess that should solve most of the problems here. Specifically, TXMLDecodingSource.SkipUntil doesn't allow #1..#31 if FXML11Rules is true, which I think it should. ___ fpc-devel maillist - fpc-devel@lists.freepascal.org http://lists.freepascal.org/mailman/listinfo/fpc-devel
Re: [fpc-devel] XML Components
On Fri, 2 Nov 2012, Jeppe Græsdal Johansen wrote: and LF, to appear in data, both in literal and escaped forms. In other words, XML is wrong technology to work with binary data, unless it is encoded into textual form (Base64 or alike). Regards, Sergei ___ fpc-devel maillist - fpc-devel@lists.freepascal.org http://lists.freepascal.org/mailman/listinfo/fpc-devel XML 1.1 allows anything down to #1, but the current parser doesn't seem to allow that. I guess that should solve most of the problems here. Specifically, TXMLDecodingSource.SkipUntil doesn't allow #1..#31 if FXML11Rules is true, which I think it should. But the document prolog specified XML version 1.0, so these characters are not allowed. If of course Andrew creates the XML himself, he can specify 1.1 as the XML version, then that may well be a solution. Note that the specs of 1.1 still say that The characters defined in the following ranges are also discouraged. They are either control characters or permanently undefined Unicode characters: [#x1-#x8], [#xB-#xC], [#xE-#x1F], [#x7F-#x84], [#x86-#x9F], [#xFDD0-#xFDDF], [#x1FFFE-#x1], [#x2FFFE-#x2], [#x3FFFE-#x3], [#x4FFFE-#x4], [#x5FFFE-#x5], [#x6FFFE-#x6], [#x7FFFE-#x7], [#x8FFFE-#x8], [#x9FFFE-#x9], [#xAFFFE-#xA], [#xBFFFE-#xB], [#xCFFFE-#xC], [#xDFFFE-#xD], [#xEFFFE-#xE], [#xE-#xF], [#x10FFFE-#x10]. So it would be wise to replace the characters with encoded data. Michael.___ fpc-devel maillist - fpc-devel@lists.freepascal.org http://lists.freepascal.org/mailman/listinfo/fpc-devel
Re: [fpc-devel] XML Components
Den 02-11-2012 18:04, Michael Van Canneyt skrev: On Fri, 2 Nov 2012, Jeppe Græsdal Johansen wrote: and LF, to appear in data, both in literal and escaped forms. In other words, XML is wrong technology to work with binary data, unless it is encoded into textual form (Base64 or alike). Regards, Sergei ___ fpc-devel maillist - fpc-devel@lists.freepascal.org http://lists.freepascal.org/mailman/listinfo/fpc-devel XML 1.1 allows anything down to #1, but the current parser doesn't seem to allow that. I guess that should solve most of the problems here. Specifically, TXMLDecodingSource.SkipUntil doesn't allow #1..#31 if FXML11Rules is true, which I think it should. But the document prolog specified XML version 1.0, so these characters are not allowed. If of course Andrew creates the XML himself, he can specify 1.1 as the XML version, then that may well be a solution. Yes, but changing the version still generates the same error, even though it shouldn't Note that the specs of 1.1 still say that The characters defined in the following ranges are also discouraged. They are either control characters or permanently undefined Unicode characters: [#x1-#x8], [#xB-#xC], [#xE-#x1F], [#x7F-#x84], [#x86-#x9F], [#xFDD0-#xFDDF], [#x1FFFE-#x1], [#x2FFFE-#x2], [#x3FFFE-#x3], [#x4FFFE-#x4], [#x5FFFE-#x5], [#x6FFFE-#x6], [#x7FFFE-#x7], [#x8FFFE-#x8], [#x9FFFE-#x9], [#xAFFFE-#xA], [#xBFFFE-#xB], [#xCFFFE-#xC], [#xDFFFE-#xD], [#xEFFFE-#xE], [#xE-#xF], [#x10FFFE-#x10]. So it would be wise to replace the characters with encoded data. Michael. It's true that it's not a good idea, but it should work at least :) ___ fpc-devel maillist - fpc-devel@lists.freepascal.org http://lists.freepascal.org/mailman/listinfo/fpc-devel
Re: [fpc-devel] XML Components
On Fri, 2 Nov 2012, Jeppe Græsdal Johansen wrote: Den 02-11-2012 18:04, Michael Van Canneyt skrev: On Fri, 2 Nov 2012, Jeppe Græsdal Johansen wrote: and LF, to appear in data, both in literal and escaped forms. In other words, XML is wrong technology to work with binary data, unless it is encoded into textual form (Base64 or alike). Regards, Sergei ___ fpc-devel maillist - fpc-devel@lists.freepascal.org http://lists.freepascal.org/mailman/listinfo/fpc-devel XML 1.1 allows anything down to #1, but the current parser doesn't seem to allow that. I guess that should solve most of the problems here. Specifically, TXMLDecodingSource.SkipUntil doesn't allow #1..#31 if FXML11Rules is true, which I think it should. But the document prolog specified XML version 1.0, so these characters are not allowed. If of course Andrew creates the XML himself, he can specify 1.1 as the XML version, then that may well be a solution. Yes, but changing the version still generates the same error, even though it shouldn't That should be fixed of course, no argument there :-) Michael.___ fpc-devel maillist - fpc-devel@lists.freepascal.org http://lists.freepascal.org/mailman/listinfo/fpc-devel
Re: [fpc-devel] XML Components
02.11.2012 21:06, Jeppe Græsdal Johansen пишет: Den 02-11-2012 18:04, Michael Van Canneyt skrev: On Fri, 2 Nov 2012, Jeppe Græsdal Johansen wrote: and LF, to appear in data, both in literal and escaped forms. In other words, XML is wrong technology to work with binary data, unless it is encoded into textual form (Base64 or alike). Regards, Sergei ___ fpc-devel maillist - fpc-devel@lists.freepascal.org http://lists.freepascal.org/mailman/listinfo/fpc-devel XML 1.1 allows anything down to #1, but the current parser doesn't seem to allow that. I guess that should solve most of the problems here. Specifically, TXMLDecodingSource.SkipUntil doesn't allow #1..#31 if FXML11Rules is true, which I think it should. But the document prolog specified XML version 1.0, so these characters are not allowed. If of course Andrew creates the XML himself, he can specify 1.1 as the XML version, then that may well be a solution. Yes, but changing the version still generates the same error, even though it shouldn't Note that the specs of 1.1 still say that The characters defined in the following ranges are also discouraged. They are either control characters or permanently undefined Unicode characters: [#x1-#x8], [#xB-#xC], [#xE-#x1F], [#x7F-#x84], [#x86-#x9F], [#xFDD0-#xFDDF], [#x1FFFE-#x1], [#x2FFFE-#x2], [#x3FFFE-#x3], [#x4FFFE-#x4], [#x5FFFE-#x5], [#x6FFFE-#x6], [#x7FFFE-#x7], [#x8FFFE-#x8], [#x9FFFE-#x9], [#xAFFFE-#xA], [#xBFFFE-#xB], [#xCFFFE-#xC], [#xDFFFE-#xD], [#xEFFFE-#xE], [#xE-#xF], [#x10FFFE-#x10]. So it would be wise to replace the characters with encoded data. Michael. It's true that it's not a good idea, but it should work at least :) To my knowledge, XML 1.1 supports codes below 32 only in escaped forms. #27; is valid in XML 1.1 but not in XML 1.0. Regards, Sergei ___ fpc-devel maillist - fpc-devel@lists.freepascal.org http://lists.freepascal.org/mailman/listinfo/fpc-devel
Re: [fpc-devel] XML Components
02.11.2012 19:57, Andrew Brunner пишет: So where in the specs does it say that parsers must reject certain byte sequences between cdata tags excepting XML tags. If this is supported by specs it would help shape a viable solution. This is not supported. Encoding processing, line-feed normalization and invalid character rejection happens before attempting to detect any markup. This is necessary to support encodings like ISO-2022-jp, which uses ESC sequences to switch between meaning of subsequent chars. CDATA is, in fact, even more restricted than text content. Outside of CDATA, a character unrepresentable in target encoding can be ampersand-escaped. Within CDATA this cannot be done. CDATA is intended only to handle '' and '' as plaintext, nothing more. Regards, Sergei ___ fpc-devel maillist - fpc-devel@lists.freepascal.org http://lists.freepascal.org/mailman/listinfo/fpc-devel
Re: [fpc-devel] XML Components
Den 02-11-2012 18:19, Sergei Gorelkin skrev: 02.11.2012 21:06, Jeppe Græsdal Johansen пишет: Den 02-11-2012 18:04, Michael Van Canneyt skrev: On Fri, 2 Nov 2012, Jeppe Græsdal Johansen wrote: and LF, to appear in data, both in literal and escaped forms. In other words, XML is wrong technology to work with binary data, unless it is encoded into textual form (Base64 or alike). Regards, Sergei ___ fpc-devel maillist - fpc-devel@lists.freepascal.org http://lists.freepascal.org/mailman/listinfo/fpc-devel XML 1.1 allows anything down to #1, but the current parser doesn't seem to allow that. I guess that should solve most of the problems here. Specifically, TXMLDecodingSource.SkipUntil doesn't allow #1..#31 if FXML11Rules is true, which I think it should. But the document prolog specified XML version 1.0, so these characters are not allowed. If of course Andrew creates the XML himself, he can specify 1.1 as the XML version, then that may well be a solution. Yes, but changing the version still generates the same error, even though it shouldn't Note that the specs of 1.1 still say that The characters defined in the following ranges are also discouraged. They are either control characters or permanently undefined Unicode characters: [#x1-#x8], [#xB-#xC], [#xE-#x1F], [#x7F-#x84], [#x86-#x9F], [#xFDD0-#xFDDF], [#x1FFFE-#x1], [#x2FFFE-#x2], [#x3FFFE-#x3], [#x4FFFE-#x4], [#x5FFFE-#x5], [#x6FFFE-#x6], [#x7FFFE-#x7], [#x8FFFE-#x8], [#x9FFFE-#x9], [#xAFFFE-#xA], [#xBFFFE-#xB], [#xCFFFE-#xC], [#xDFFFE-#xD], [#xEFFFE-#xE], [#xE-#xF], [#x10FFFE-#x10]. So it would be wise to replace the characters with encoded data. Michael. It's true that it's not a good idea, but it should work at least :) To my knowledge, XML 1.1 supports codes below 32 only in escaped forms. #27; is valid in XML 1.1 but not in XML 1.0. Regards, Sergei According to the spec, cdata should allow them unescaped. [2] Char ::= [#x1-#xD7FF] | [#xE000-#xFFFD] | [#x1-#x10] [20] CData ::= (Char* - (Char* ']]' Char*)) ___ fpc-devel maillist - fpc-devel@lists.freepascal.org http://lists.freepascal.org/mailman/listinfo/fpc-devel
Re: [fpc-devel] XML Components
02.11.2012 21:22, Jeppe Græsdal Johansen пишет: Den 02-11-2012 18:19, Sergei Gorelkin skrev: 02.11.2012 21:06, Jeppe Græsdal Johansen пишет: Den 02-11-2012 18:04, Michael Van Canneyt skrev: On Fri, 2 Nov 2012, Jeppe Græsdal Johansen wrote: and LF, to appear in data, both in literal and escaped forms. In other words, XML is wrong technology to work with binary data, unless it is encoded into textual form (Base64 or alike). Regards, Sergei ___ fpc-devel maillist - fpc-devel@lists.freepascal.org http://lists.freepascal.org/mailman/listinfo/fpc-devel XML 1.1 allows anything down to #1, but the current parser doesn't seem to allow that. I guess that should solve most of the problems here. Specifically, TXMLDecodingSource.SkipUntil doesn't allow #1..#31 if FXML11Rules is true, which I think it should. But the document prolog specified XML version 1.0, so these characters are not allowed. If of course Andrew creates the XML himself, he can specify 1.1 as the XML version, then that may well be a solution. Yes, but changing the version still generates the same error, even though it shouldn't Note that the specs of 1.1 still say that The characters defined in the following ranges are also discouraged. They are either control characters or permanently undefined Unicode characters: [#x1-#x8], [#xB-#xC], [#xE-#x1F], [#x7F-#x84], [#x86-#x9F], [#xFDD0-#xFDDF], [#x1FFFE-#x1], [#x2FFFE-#x2], [#x3FFFE-#x3], [#x4FFFE-#x4], [#x5FFFE-#x5], [#x6FFFE-#x6], [#x7FFFE-#x7], [#x8FFFE-#x8], [#x9FFFE-#x9], [#xAFFFE-#xA], [#xBFFFE-#xB], [#xCFFFE-#xC], [#xDFFFE-#xD], [#xEFFFE-#xE], [#xE-#xF], [#x10FFFE-#x10]. So it would be wise to replace the characters with encoded data. Michael. It's true that it's not a good idea, but it should work at least :) To my knowledge, XML 1.1 supports codes below 32 only in escaped forms. #27; is valid in XML 1.1 but not in XML 1.0. Regards, Sergei According to the spec, cdata should allow them unescaped. [2] Char ::= [#x1-#xD7FF] | [#xE000-#xFFFD] | [#x1-#x10] [20] CData ::= (Char* - (Char* ']]' Char*)) I see now, they were allowed in second edition of the specs, while the first edition did not allow them. Current fcl-xml implementation corresponds to the first edition. So it's indeed time to update. Regards, Sergei ___ fpc-devel maillist - fpc-devel@lists.freepascal.org http://lists.freepascal.org/mailman/listinfo/fpc-devel
Re: [fpc-devel] XML Components
On 11/2/2012 09:32, Sergei Gorelkin wrote: In other words, XML is wrong technology to work with binary data, unless it is encoded into textual form (Base64 or alike). encoding into textual form one increases the size of the stream by at least 1/3rd... a 3M file will be a 4M stream when encoded... ___ fpc-devel maillist - fpc-devel@lists.freepascal.org http://lists.freepascal.org/mailman/listinfo/fpc-devel
Re: [fpc-devel] XML Components
On Nov 2, 2012, at 6:39 PM, waldo kitty wkitt...@windstream.net wrote: On 11/2/2012 09:32, Sergei Gorelkin wrote: In other words, XML is wrong technology to work with binary data, unless it is encoded into textual form (Base64 or alike). encoding into textual form one increases the size of the stream by at least 1/3rd... a 3M file will be a 4M stream Thats all true buy binary xfer is not doable on some mail servers. My server uses deflate stream compression. I have multicore servers. My larger problem are these datagram Thats all true buy binary xfer is not doable on some mail servers. My server uses deflate stream compression. I have multicore servers. My larger problem are the datagram values containing strings that fail. So with encoding of binary inflated - I would still have to parse each byte. To me that is the problem. Andrew Brunner Aurawin LLC 512.574.6298 A safe new way to store and share your files, pictures, videos and more. http://aurawin.com ___ fpc-devel maillist - fpc-devel@lists.freepascal.org http://lists.freepascal.org/mailman/listinfo/fpc-devel
[fpc-devel] XML Components
I'm having a problem getting the XML parser to read. Is there any way I can get the attached program to work by changing a parsing option to one less strict. My XML documents get over 1-2 GBs since they represent files. So having to convert /scan each byte is unacceptable. Is there another XML parser component that can establish a DOM? Or is this a bug in the fpc XML component? Any help or feedback is entirely welcome and needed. This data in currently in at least 1 stream and failing my cloud desktop sync application. I would really love an option to disable XML byte for byte checking during parsing. -- Andrew Brunner Aurawin LLC 512.574.6298 http://aurawin.com/ Aurawin is a great new way to store, share, and enjoy your photos, videos, music and more. program invalid; var FXMLParser : TDOMParser; FXMLDocument : TXMLDocument; FXMLSource : TXMLInputSource; sData: String; begin sData:=Concat( '?xml version=1.0 encoding='+ 'UTF-8', // native platform is LATIN1 '?', 'value', '![CDATA[$B0u testt]]', '/value' ); sData:=System.AnsiToUtf8(sData); FXMLParser:=TDOMParser.Create(); Try FXMLSource:=TXMLInputSource.Create(sData); Try FXMLParser.Parse(FXMLSource,FXMLDocument); Try Finally FXMLDocument.Free(); end; Finally FXMLSource.Free(); end; Finally FXMLParser.Free(); end; end; end. ___ fpc-devel maillist - fpc-devel@lists.freepascal.org http://lists.freepascal.org/mailman/listinfo/fpc-devel