Re: [fpc-devel] XML Components

2012-11-03 Thread Andrew Brunner

I just added this Prepare method to my database API.

class function XML.IsInvalid(var Value:Byte):boolean;
begin
  Result:=(Value9) or (Value=11) or (Value=12) or ( (Value13) and 
(Value32));

end;

class function  XML.Prepare(var sInput:string; Refactor:TStream):string;
var
  bChar:byte;
  iLcv:Int64;
  iLen:Int64;
  sReplace:string;
begin
  Refactor.Size:=0;
  for iLcv:=1 to System.Length(sInput) do begin
bChar:=Byte(sInput[iLcv]);
if IsInvalid(bChar) then begin
  sReplace:=Concat('#',IntToStr(bChar),';');
  iLen:=System.Length(sReplace);
  Refactor.Write(sReplace[1],iLen);
end else
  Refactor.Write(bChar,1);
  end;
  System.SetLength(Result,Refactor.Size);
  if Refactor.Size0 then
Refactor.Read(Result[1],Refactor.Size);
  Refactor.Size:=0;
end;

The question is, what is going to happen when the encoding is UTF8 or 
UTF16?  Will this code allow bytes to go by without messing them all up?



--
Andrew Brunner

Aurawin LLC
512.574.6298
http://aurawin.com/

Aurawin is a great new way to store, share, and enjoy your
photos, videos, music and more.

___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/mailman/listinfo/fpc-devel


Re: [fpc-devel] XML Components

2012-11-03 Thread Hans-Peter Diettrich

Andrew Brunner schrieb:

I just added this Prepare method to my database API.

class function XML.IsInvalid(var Value:Byte):boolean;
begin
  Result:=(Value9) or (Value=11) or (Value=12) or ( (Value13) and 
(Value32));

end;

[...]

The question is, what is going to happen when the encoding is UTF8 or 
UTF16?  Will this code allow bytes to go by without messing them all up?


The control characters, precisely the full ASCII character set, has the 
same char values in *every* Unicode encoding. In UTF-8 encoding the 
escape sequences have bytes all above 127, no confusion possible with 
control characters.


In UTF-16 encoding you'll have problems when converting an WideString or 
UnicodeString into an AnsiString. I'd suggest that you use the 
predefined DOM string and char types, which IMO currently are 
WideString/WideChar.


DoDi

___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/mailman/listinfo/fpc-devel


Re: [fpc-devel] XML Components

2012-11-02 Thread Michael Van Canneyt



On Thu, 1 Nov 2012, Andrew Brunner wrote:


I'm having a problem getting the XML parser to read.

Is there any way I can get the attached program to work by changing a parsing 
option to one less strict.  My XML documents get over 1-2 GBs since they 
represent files.  So having to convert /scan each byte is unacceptable.


I suggest you revert to something else than XML, if that's an option.
XML is notoriously slow to load.



Is there another XML parser component that can establish a DOM?  Or is this a 
bug in the fpc XML component?


This is not a bug, it is prescribed behaviour.

The XML components must work on any XML document that exists out there.
As a consequence, the codepage in the XML must be checked and converted if need 
be.

Imagine you have a XML file encoded in UTF16, and we assume it's UTF-8. 
The resulting DOM tree would be unusable.


Any help or feedback is entirely welcome and needed.  This data in currently 
in at least 1 stream and failing my cloud desktop sync application.


You'll have to write your own XML handling routines which work only with 
the codepage the XML is in. And be prepared that they will fail as soon 
as the encoding of the XML changes.




I would really love an option to disable XML byte for byte checking during 
parsing.


That's not likely to happen.

Michael.
___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/mailman/listinfo/fpc-devel


Re: [fpc-devel] XML Components

2012-11-02 Thread Andrew Brunner


On Nov 2, 2012, at 7:24 AM, Michael Van Canneyt mich...@freepascal.org wrote:

 
 
 On Thu, 1 Nov 2012, Andrew Brunner wrote:
 
 I'm having a problem getting the XML parser to read.
 
 Is there any way I can get the attached program to work by changing a 
 parsing option to one less strict.  My XML documents get over 1-2 GBs since 
 they represent files.  So having to convert /scan each byte is unacceptable.
 
 I suggest you revert to something else than XML, if that's an option.
 XML is notoriously slow to load.
 

I don't know if at this point I am able to switch.  It's not practical. I could 
just grab PFC XML components and derive something outside FPC project scope.  


 
 Is there another XML parser component that can establish a DOM?  Or is this 
 a bug in the fpc XML component?
 
 This is not a bug, it is prescribed behaviour.

The function AnsiToUtf8 is supposed to convert data to utf.   So the string in 
the sample should have the proper UTF8 encoding.  And the parser should be able 
to read it. 

In the past, I was able to parse ANSI strings but only after converting to 
UTF8.  But the attached program fails. 100%

 
 The XML components must work on any XML document that exists out there.
 As a consequence, the codepage in the XML must be checked and converted if 
 need be.
 
The input data in the example attached is converted.  


 Imagine you have a XML file encoded in UTF16, and we assume it's UTF-8. The 
 resulting DOM tree would be unusable.
 

True. 


 Any help or feedback is entirely welcome and needed.  This data in currently 
 in at least 1 stream and failing my cloud desktop sync application.
 
 You'll have to write your own XML handling routines which work only with the 
 codepage the XML is in. And be prepared that they will fail as soon as the 
 encoding of the XML changes.
 

Right.  But converting the data to say UTF8 should have worked.  I have 
explicitly set the encoding to UTF8 in the header.  

 
 I would really love an option to disable XML byte for byte checking during 
 parsing.

I think it would be a good solution and even prove faster in controlled 
environments.  Plus all data is stored as widestrings in the DOM. 

The first question I have is if there was such an option would the patch be 
accepted. 

The next question is what is the problem with the uf8 routine that it left the 
offending byte sequence intact without converting the bytes in my sample data?


___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/mailman/listinfo/fpc-devel


Re: [fpc-devel] XML Components

2012-11-02 Thread Michael Van Canneyt



On Fri, 2 Nov 2012, Andrew Brunner wrote:


As a consequence, the codepage in the XML must be checked and converted if need 
be.


The input data in the example attached is converted.


There is no attachment to your mail.





Imagine you have a XML file encoded in UTF16, and we assume it's UTF-8. The 
resulting DOM tree would be unusable.



True.



Any help or feedback is entirely welcome and needed.  This data in currently in 
at least 1 stream and failing my cloud desktop sync application.


You'll have to write your own XML handling routines which work only with the 
codepage the XML is in. And be prepared that they will fail as soon as the 
encoding of the XML changes.



Right.  But converting the data to say UTF8 should have worked.  I have 
explicitly set the encoding to UTF8 in the header.


Without looking at the data and the errors you get, it's impossible to say 
anything useful.






I would really love an option to disable XML byte for byte checking during 
parsing.


I think it would be a good solution and even prove faster in controlled 
environments.  Plus all data is stored as widestrings in the DOM.

The first question I have is if there was such an option would the patch be 
accepted.


I don't see how you can fix the problem. If the input is UTF8, and the result must be converted 
to a widestring for the DOM, then a conversion MUST take place, there is no way to avoid it.

And a conversion means scanning the input byte for byte.

In each case, the input must be scanned byte for byte anyway, to detect all the tags. 
That's what makes XML slow and unusable for large amount of data.



The next question is what is the problem with the uf8 routine that it left the 
offending byte sequence intact without converting the bytes in my sample data?


Without error message, it is impossible to tell.

Michael.
___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/mailman/listinfo/fpc-devel


Re: [fpc-devel] XML Components

2012-11-02 Thread Andrew Brunner

On 11/02/2012 08:08 AM, Michael Van Canneyt wrote:


There is no attachment to your mail. 


The attachment was in my first posting.  But just in cease I've attached 
it again.


Please feel free to check it out.

The example is stripped of most of the xml code that was successfully 
parsed.  This was from about a 5mb stream.  I grabbed the xml code 
surrounding the error position.


Thanks Michael.
program unknown;

{$mode objfpc}{$H+}

uses
  {$IFDEF UNIX}{$IFDEF UseCThreads}
  cthreads,
  {$ENDIF}{$ENDIF}
  Classes,DOM, XMLRead;

  procedure Test();
  var
FXMLParser   : TDOMParser;
FXMLDocument : TXMLDocument;
FXMLSource   : TXMLInputSource;
sData: String;
  begin
sData:=Concat(
  '?xml version=1.0 encoding='+
  'UTF-8',  // native platform is LATIN1
  '?',
  'value',
  '![CDATA[$B0u testt]]',
  '/value'
);
sData:=System.AnsiToUtf8(sData);
FXMLParser:=TDOMParser.Create();
Try
FXMLSource:=TXMLInputSource.Create(sData);
Try
  FXMLParser.Parse(FXMLSource,FXMLDocument);
  Try
  Finally
FXMLDocument.Free();
  end;
Finally
  FXMLSource.Free();
end;

Finally
  FXMLParser.Free();
end;
  end;

begin
  Test();
end.

___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/mailman/listinfo/fpc-devel


Re: [fpc-devel] XML Components

2012-11-02 Thread Mattias Gaertner

Andrew Brunner atbrun...@aurawin.com hat am 2. November 2012 um 13:59
geschrieben:


 On Nov 2, 2012, at 7:24 AM, Michael Van Canneyt mich...@freepascal.org
 wrote:

 
 
  On Thu, 1 Nov 2012, Andrew Brunner wrote:
 
  I'm having a problem getting the XML parser to read.
 
  Is there any way I can get the attached program to work by changing a
  parsing option to one less strict. My XML documents get over 1-2 GBs since
  they represent files. So having to convert /scan each byte is unacceptable.
 
  I suggest you revert to something else than XML, if that's an option.
  XML is notoriously slow to load.
 

 I don't know if at this point I am able to switch. It's not practical. I could
 just grab PFC XML components and derive something outside FPC project scope.

Have you tried the xml units in Lazarus? They are 99% the fpc units just ported
to use UTF8 strings instead of widestrings.

Mattias
___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/mailman/listinfo/fpc-devel


Re: [fpc-devel] XML Components

2012-11-02 Thread Sergei Gorelkin

02.11.2012 17:08, Michael Van Canneyt пишет:



On Fri, 2 Nov 2012, Andrew Brunner wrote:



I think it would be a good solution and even prove faster in controlled 
environments.  Plus all
data is stored as widestrings in the DOM.

The first question I have is if there was such an option would the patch be 
accepted.


I don't see how you can fix the problem. If the input is UTF8, and the result 
must be converted to a
widestring for the DOM, then a conversion MUST take place, there is no way to 
avoid it.
And a conversion means scanning the input byte for byte.

In each case, the input must be scanned byte for byte anyway, to detect all the 
tags. That's what
makes XML slow and unusable for large amount of data.


The next question is what is the problem with the uf8 routine that it left the 
offending byte
sequence intact without converting the bytes in my sample data?


Without error message, it is impossible to tell.

In this case, the issue is not encoding, but literal ESC (#27) code used in data. XML specification 
does not allow codepoints below 32, except TAB,CR and LF, to appear in data, both in literal and 
escaped forms.
In other words, XML is wrong technology to work with binary data, unless it is encoded into textual 
form (Base64 or alike).


Regards,
Sergei
___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/mailman/listinfo/fpc-devel


Re: [fpc-devel] XML Components

2012-11-02 Thread Andrew Brunner


On Nov 2, 2012, at 8:32 AM, Sergei Gorelkin sergei_gorel...@mail.ru wrote:

 
 In this case, the issue is not encoding, but literal ESC (#27) code used in 
 data. XML specification does not allow codepoints below 32, except TAB,CR and 
 LF, to appear in data, both in literal and escaped forms.
 In other words, XML is wrong technology to work with binary data, unless it 
 is encoded into textual form (Base64 or alike).
 

Ok.  The data comes from a summary function that grabs a few pieces of an email 
message on this case. The subject an top 2 lines of the message.  Email is text 
so it would most likely be memory corruption as the source of any low order 
bytes like 0. 

But actual file data when streamed is done so via  MIME. 
___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/mailman/listinfo/fpc-devel


Re: [fpc-devel] XML Components

2012-11-02 Thread Mattias Gaertner

Sergei Gorelkin sergei_gorel...@mail.ru hat am 2. November 2012 um 14:32
geschrieben:
 02.11.2012 17:08, Michael Van Canneyt пишет:
 
 
  On Fri, 2 Nov 2012, Andrew Brunner wrote:
 
 
  I think it would be a good solution and even prove faster in controlled
  environments. Plus all
  data is stored as widestrings in the DOM.
 
  The first question I have is if there was such an option would the patch be
  accepted.
 
  I don't see how you can fix the problem. If the input is UTF8, and the
  result must be converted to a
  widestring for the DOM, then a conversion MUST take place, there is no way
  to avoid it.
  And a conversion means scanning the input byte for byte.
 
  In each case, the input must be scanned byte for byte anyway, to detect all
  the tags. That's what
  makes XML slow and unusable for large amount of data.
 
  The next question is what is the problem with the uf8 routine that it left
  the offending byte
  sequence intact without converting the bytes in my sample data?
 
  Without error message, it is impossible to tell.
 
 In this case, the issue is not encoding, but literal ESC (#27) code used in
 data. XML specification
 does not allow codepoints below 32, except TAB,CR and LF, to appear in data,
 both in literal and
 escaped forms.

Actually the specification only defines legal characters and that processors
must accept them.
It does not say what to do with the other characters.


 In other words, XML is wrong technology to work with binary data, unless it is
 encoded into textual
 form (Base64 or alike).

True.

Mattias
___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/mailman/listinfo/fpc-devel


Re: [fpc-devel] XML Components

2012-11-02 Thread Sergei Gorelkin

02.11.2012 17:44, Mattias Gaertner пишет:


Sergei Gorelkin sergei_gorel...@mail.ru hat am 2. November 2012 um 14:32
geschrieben:

In this case, the issue is not encoding, but literal ESC (#27) code used in
data. XML specification
does not allow codepoints below 32, except TAB,CR and LF, to appear in data,
both in literal and
escaped forms.


Actually the specification only defines legal characters and that processors
must accept them.
It does not say what to do with the other characters.

Besides specification, there is a test suite containing lots of tests with illegal characters and 
expecting them all to fail.


Regards,
Sergei
___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/mailman/listinfo/fpc-devel


Re: [fpc-devel] XML Components

2012-11-02 Thread Andrew Brunner
So where in the specs does it say that parsers must reject certain byte 
sequences between cdata tags excepting XML tags. 

If this is supported by specs it would help shape a viable solution. 



On Nov 2, 2012, at 9:01 AM, Sergei Gorelkin sergei_gorel...@mail.ru wrote:

 02.11.2012 17:44, Mattias Gaertner пишет:
 
 Sergei Gorelkin sergei_gorel...@mail.ru hat am 2. November 2012 um 14:32
 geschrieben:
 In this case, the issue is not encoding, but literal ESC (#27) code used in
 data. XML specification
 does not allow codepoints below 32, except TAB,CR and LF, to appear in data,
 both in literal and
 escaped forms.
 
 Actually the specification only defines legal characters and that processors
 must accept them.
 It does not say what to do with the other characters.
 
 Besides specification, there is a test suite containing lots of tests with 
 illegal characters and expecting them all to fail.
 
 Regards,
 Sergei
 ___
 fpc-devel maillist  -  fpc-devel@lists.freepascal.org
 http://lists.freepascal.org/mailman/listinfo/fpc-devel
___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/mailman/listinfo/fpc-devel


Re: [fpc-devel] XML Components

2012-11-02 Thread Michael Van Canneyt



On Fri, 2 Nov 2012, Andrew Brunner wrote:


So where in the specs does it say that parsers must reject certain byte 
sequences between cdata tags excepting XML tags.

If this is supported by specs it would help shape a viable solution.


Where did you get that it is supported ?

The specs list the allowed characters. Section 2.2:

[Definition: A character is an atomic unit of text as specified by ISO/IEC 10646:2000 [ISO/IEC 10646]. 
Legal characters are tab, carriage return, line feed, and the legal characters of Unicode and ISO/IEC 10646.


Therefor, any character not in the list is not a legal character and should be 
rejected.

Speaking from painful experience: Relaxing this and silently ignoring these illegal 
characters will at some point lead to problems when you encounter a system that 
enforces the rules more strictly.


You will then wonder how it can be that the XML is considered invalid when your own 
XML code handles it so nicely. Whereas now, you already know.


To show that this is not just idle talk: I ran your XML through the MS-XML parser, 
and it complained just as well about an illegal character in the input.


Forewarned is forearmed.

Michael.
___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/mailman/listinfo/fpc-devel


Re: [fpc-devel] XML Components

2012-11-02 Thread Jeppe Græsdal Johansen

Den 02-11-2012 14:32, Sergei Gorelkin skrev:

02.11.2012 17:08, Michael Van Canneyt пишет:



On Fri, 2 Nov 2012, Andrew Brunner wrote:



I think it would be a good solution and even prove faster in 
controlled environments.  Plus all

data is stored as widestrings in the DOM.

The first question I have is if there was such an option would the 
patch be accepted.


I don't see how you can fix the problem. If the input is UTF8, and 
the result must be converted to a
widestring for the DOM, then a conversion MUST take place, there is 
no way to avoid it.

And a conversion means scanning the input byte for byte.

In each case, the input must be scanned byte for byte anyway, to 
detect all the tags. That's what

makes XML slow and unusable for large amount of data.

The next question is what is the problem with the uf8 routine that 
it left the offending byte

sequence intact without converting the bytes in my sample data?


Without error message, it is impossible to tell.

In this case, the issue is not encoding, but literal ESC (#27) code 
used in data. XML specification does not allow codepoints below 32, 
except TAB,CR and LF, to appear in data, both in literal and escaped 
forms.
In other words, XML is wrong technology to work with binary data, 
unless it is encoded into textual form (Base64 or alike).


Regards,
Sergei
___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/mailman/listinfo/fpc-devel
XML 1.1 allows anything down to #1, but the current parser doesn't seem 
to allow that. I guess that should solve most of the problems here.


Specifically, TXMLDecodingSource.SkipUntil doesn't allow #1..#31 if 
FXML11Rules is true, which I think it should.

___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/mailman/listinfo/fpc-devel


Re: [fpc-devel] XML Components

2012-11-02 Thread Michael Van Canneyt



On Fri, 2 Nov 2012, Jeppe Græsdal Johansen wrote:


and LF, to appear in data, both in literal and escaped forms.
In other words, XML is wrong technology to work with binary data, unless it 
is encoded into textual form (Base64 or alike).


Regards,
Sergei
___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/mailman/listinfo/fpc-devel
XML 1.1 allows anything down to #1, but the current parser doesn't seem to 
allow that. I guess that should solve most of the problems here.


Specifically, TXMLDecodingSource.SkipUntil doesn't allow #1..#31 if 
FXML11Rules is true, which I think it should.


But the document prolog specified XML version 1.0,
so these characters are not allowed.

If of course Andrew creates the XML himself, he can specify 
1.1 as the XML version, then that may well be a solution.


Note that the specs of 1.1 still say that

The characters defined in the following ranges are also discouraged. 
They are either control characters or permanently undefined Unicode characters:


[#x1-#x8], [#xB-#xC], [#xE-#x1F], [#x7F-#x84], [#x86-#x9F], [#xFDD0-#xFDDF],
[#x1FFFE-#x1], [#x2FFFE-#x2], [#x3FFFE-#x3],
[#x4FFFE-#x4], [#x5FFFE-#x5], [#x6FFFE-#x6],
[#x7FFFE-#x7], [#x8FFFE-#x8], [#x9FFFE-#x9],
[#xAFFFE-#xA], [#xBFFFE-#xB], [#xCFFFE-#xC],
[#xDFFFE-#xD], [#xEFFFE-#xE], [#xE-#xF],
[#x10FFFE-#x10].

So it would be wise to replace the characters with encoded data.

Michael.___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/mailman/listinfo/fpc-devel


Re: [fpc-devel] XML Components

2012-11-02 Thread Jeppe Græsdal Johansen

Den 02-11-2012 18:04, Michael Van Canneyt skrev:



On Fri, 2 Nov 2012, Jeppe Græsdal Johansen wrote:


and LF, to appear in data, both in literal and escaped forms.
In other words, XML is wrong technology to work with binary data, 
unless it is encoded into textual form (Base64 or alike).


Regards,
Sergei
___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/mailman/listinfo/fpc-devel
XML 1.1 allows anything down to #1, but the current parser doesn't 
seem to allow that. I guess that should solve most of the problems here.


Specifically, TXMLDecodingSource.SkipUntil doesn't allow #1..#31 if 
FXML11Rules is true, which I think it should.


But the document prolog specified XML version 1.0,
so these characters are not allowed.

If of course Andrew creates the XML himself, he can specify 1.1 as the 
XML version, then that may well be a solution.
Yes, but changing the version still generates the same error, even 
though it shouldn't

Note that the specs of 1.1 still say that

The characters defined in the following ranges are also discouraged. 
They are either control characters or permanently undefined Unicode 
characters:


[#x1-#x8], [#xB-#xC], [#xE-#x1F], [#x7F-#x84], [#x86-#x9F], 
[#xFDD0-#xFDDF],

[#x1FFFE-#x1], [#x2FFFE-#x2], [#x3FFFE-#x3],
[#x4FFFE-#x4], [#x5FFFE-#x5], [#x6FFFE-#x6],
[#x7FFFE-#x7], [#x8FFFE-#x8], [#x9FFFE-#x9],
[#xAFFFE-#xA], [#xBFFFE-#xB], [#xCFFFE-#xC],
[#xDFFFE-#xD], [#xEFFFE-#xE], [#xE-#xF],
[#x10FFFE-#x10].

So it would be wise to replace the characters with encoded data.

Michael.

It's true that it's not a good idea, but it should work at least :)
___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/mailman/listinfo/fpc-devel


Re: [fpc-devel] XML Components

2012-11-02 Thread Michael Van Canneyt



On Fri, 2 Nov 2012, Jeppe Græsdal Johansen wrote:


Den 02-11-2012 18:04, Michael Van Canneyt skrev:



On Fri, 2 Nov 2012, Jeppe Græsdal Johansen wrote:


and LF, to appear in data, both in literal and escaped forms.
In other words, XML is wrong technology to work with binary data, unless 
it is encoded into textual form (Base64 or alike).


Regards,
Sergei
___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/mailman/listinfo/fpc-devel
XML 1.1 allows anything down to #1, but the current parser doesn't seem to 
allow that. I guess that should solve most of the problems here.


Specifically, TXMLDecodingSource.SkipUntil doesn't allow #1..#31 if 
FXML11Rules is true, which I think it should.


But the document prolog specified XML version 1.0,
so these characters are not allowed.

If of course Andrew creates the XML himself, he can specify 1.1 as the XML 
version, then that may well be a solution.
Yes, but changing the version still generates the same error, even though it 
shouldn't


That should be fixed of course, no argument there :-)

Michael.___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/mailman/listinfo/fpc-devel


Re: [fpc-devel] XML Components

2012-11-02 Thread Sergei Gorelkin

02.11.2012 21:06, Jeppe Græsdal Johansen пишет:

Den 02-11-2012 18:04, Michael Van Canneyt skrev:



On Fri, 2 Nov 2012, Jeppe Græsdal Johansen wrote:


and LF, to appear in data, both in literal and escaped forms.
In other words, XML is wrong technology to work with binary data, unless it is 
encoded into
textual form (Base64 or alike).

Regards,
Sergei
___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/mailman/listinfo/fpc-devel

XML 1.1 allows anything down to #1, but the current parser doesn't seem to 
allow that. I guess
that should solve most of the problems here.

Specifically, TXMLDecodingSource.SkipUntil doesn't allow #1..#31 if FXML11Rules 
is true, which I
think it should.


But the document prolog specified XML version 1.0,
so these characters are not allowed.

If of course Andrew creates the XML himself, he can specify 1.1 as the XML 
version, then that may
well be a solution.

Yes, but changing the version still generates the same error, even though it 
shouldn't

Note that the specs of 1.1 still say that

The characters defined in the following ranges are also discouraged. They are 
either control
characters or permanently undefined Unicode characters:

[#x1-#x8], [#xB-#xC], [#xE-#x1F], [#x7F-#x84], [#x86-#x9F], [#xFDD0-#xFDDF],
[#x1FFFE-#x1], [#x2FFFE-#x2], [#x3FFFE-#x3],
[#x4FFFE-#x4], [#x5FFFE-#x5], [#x6FFFE-#x6],
[#x7FFFE-#x7], [#x8FFFE-#x8], [#x9FFFE-#x9],
[#xAFFFE-#xA], [#xBFFFE-#xB], [#xCFFFE-#xC],
[#xDFFFE-#xD], [#xEFFFE-#xE], [#xE-#xF],
[#x10FFFE-#x10].

So it would be wise to replace the characters with encoded data.

Michael.

It's true that it's not a good idea, but it should work at least :)


To my knowledge, XML 1.1 supports codes below 32 only in escaped forms. #27; is valid in XML 1.1 
but not in XML 1.0.


Regards,
Sergei



___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/mailman/listinfo/fpc-devel


Re: [fpc-devel] XML Components

2012-11-02 Thread Sergei Gorelkin

02.11.2012 19:57, Andrew Brunner пишет:

So where in the specs does it say that parsers must reject certain byte 
sequences between cdata tags excepting XML tags.

If this is supported by specs it would help shape a viable solution.

This is not supported. Encoding processing, line-feed normalization and invalid character rejection 
happens before attempting to detect any markup. This is necessary to support encodings like 
ISO-2022-jp, which uses ESC sequences to switch between meaning of subsequent chars.


CDATA is, in fact, even more restricted than text content. Outside of CDATA, a 
character
unrepresentable in target encoding can be ampersand-escaped. Within CDATA this 
cannot be done.
CDATA is intended only to handle '' and '' as plaintext, nothing more.

Regards,
Sergei
___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/mailman/listinfo/fpc-devel


Re: [fpc-devel] XML Components

2012-11-02 Thread Jeppe Græsdal Johansen

Den 02-11-2012 18:19, Sergei Gorelkin skrev:

02.11.2012 21:06, Jeppe Græsdal Johansen пишет:

Den 02-11-2012 18:04, Michael Van Canneyt skrev:



On Fri, 2 Nov 2012, Jeppe Græsdal Johansen wrote:


and LF, to appear in data, both in literal and escaped forms.
In other words, XML is wrong technology to work with binary data, 
unless it is encoded into

textual form (Base64 or alike).

Regards,
Sergei
___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/mailman/listinfo/fpc-devel
XML 1.1 allows anything down to #1, but the current parser doesn't 
seem to allow that. I guess

that should solve most of the problems here.

Specifically, TXMLDecodingSource.SkipUntil doesn't allow #1..#31 if 
FXML11Rules is true, which I

think it should.


But the document prolog specified XML version 1.0,
so these characters are not allowed.

If of course Andrew creates the XML himself, he can specify 1.1 as 
the XML version, then that may

well be a solution.
Yes, but changing the version still generates the same error, even 
though it shouldn't

Note that the specs of 1.1 still say that

The characters defined in the following ranges are also 
discouraged. They are either control

characters or permanently undefined Unicode characters:

[#x1-#x8], [#xB-#xC], [#xE-#x1F], [#x7F-#x84], [#x86-#x9F], 
[#xFDD0-#xFDDF],

[#x1FFFE-#x1], [#x2FFFE-#x2], [#x3FFFE-#x3],
[#x4FFFE-#x4], [#x5FFFE-#x5], [#x6FFFE-#x6],
[#x7FFFE-#x7], [#x8FFFE-#x8], [#x9FFFE-#x9],
[#xAFFFE-#xA], [#xBFFFE-#xB], [#xCFFFE-#xC],
[#xDFFFE-#xD], [#xEFFFE-#xE], [#xE-#xF],
[#x10FFFE-#x10].

So it would be wise to replace the characters with encoded data.

Michael.

It's true that it's not a good idea, but it should work at least :)


To my knowledge, XML 1.1 supports codes below 32 only in escaped 
forms. #27; is valid in XML 1.1 but not in XML 1.0.


Regards,
Sergei

According to the spec, cdata should allow them unescaped.

[2]   Char   ::=   [#x1-#xD7FF] | [#xE000-#xFFFD] | 
[#x1-#x10]


[20]   CData   ::=   (Char* - (Char* ']]' Char*))
___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/mailman/listinfo/fpc-devel


Re: [fpc-devel] XML Components

2012-11-02 Thread Sergei Gorelkin

02.11.2012 21:22, Jeppe Græsdal Johansen пишет:

Den 02-11-2012 18:19, Sergei Gorelkin skrev:

02.11.2012 21:06, Jeppe Græsdal Johansen пишет:

Den 02-11-2012 18:04, Michael Van Canneyt skrev:



On Fri, 2 Nov 2012, Jeppe Græsdal Johansen wrote:


and LF, to appear in data, both in literal and escaped forms.
In other words, XML is wrong technology to work with binary data, unless it is 
encoded into
textual form (Base64 or alike).

Regards,
Sergei
___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/mailman/listinfo/fpc-devel

XML 1.1 allows anything down to #1, but the current parser doesn't seem to 
allow that. I guess
that should solve most of the problems here.

Specifically, TXMLDecodingSource.SkipUntil doesn't allow #1..#31 if FXML11Rules 
is true, which I
think it should.


But the document prolog specified XML version 1.0,
so these characters are not allowed.

If of course Andrew creates the XML himself, he can specify 1.1 as the XML 
version, then that may
well be a solution.

Yes, but changing the version still generates the same error, even though it 
shouldn't

Note that the specs of 1.1 still say that

The characters defined in the following ranges are also discouraged. They are 
either control
characters or permanently undefined Unicode characters:

[#x1-#x8], [#xB-#xC], [#xE-#x1F], [#x7F-#x84], [#x86-#x9F], [#xFDD0-#xFDDF],
[#x1FFFE-#x1], [#x2FFFE-#x2], [#x3FFFE-#x3],
[#x4FFFE-#x4], [#x5FFFE-#x5], [#x6FFFE-#x6],
[#x7FFFE-#x7], [#x8FFFE-#x8], [#x9FFFE-#x9],
[#xAFFFE-#xA], [#xBFFFE-#xB], [#xCFFFE-#xC],
[#xDFFFE-#xD], [#xEFFFE-#xE], [#xE-#xF],
[#x10FFFE-#x10].

So it would be wise to replace the characters with encoded data.

Michael.

It's true that it's not a good idea, but it should work at least :)


To my knowledge, XML 1.1 supports codes below 32 only in escaped forms. #27; 
is valid in XML
1.1 but not in XML 1.0.

Regards,
Sergei

According to the spec, cdata should allow them unescaped.

[2]   Char   ::=   [#x1-#xD7FF] | [#xE000-#xFFFD] | 
[#x1-#x10]

[20]   CData   ::=   (Char* - (Char* ']]' Char*))


I see now, they were allowed in second edition of the specs, while the first edition did not allow 
them. Current fcl-xml implementation corresponds to the first edition. So it's indeed time to update.


Regards,
Sergei
___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/mailman/listinfo/fpc-devel


Re: [fpc-devel] XML Components

2012-11-02 Thread waldo kitty

On 11/2/2012 09:32, Sergei Gorelkin wrote:

In other words, XML is wrong technology to work with binary data, unless it is
encoded into textual form (Base64 or alike).


encoding into textual form one increases the size of the stream by at least 
1/3rd... a 3M file will be a 4M stream when encoded...


___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/mailman/listinfo/fpc-devel


Re: [fpc-devel] XML Components

2012-11-02 Thread Andrew Brunner

On Nov 2, 2012, at 6:39 PM, waldo kitty wkitt...@windstream.net wrote:

 On 11/2/2012 09:32, Sergei Gorelkin wrote:
 In other words, XML is wrong technology to work with binary data, unless it 
 is
 encoded into textual form (Base64 or alike).
 
 encoding into textual form one increases the size of the stream by at least 
 1/3rd... a 3M file will be a 4M stream 


Thats all true buy binary xfer is not doable on some mail servers. 

My server uses deflate stream compression.  I have multicore servers.  

My larger problem are these datagram Thats all true buy binary xfer is not 
doable on some mail servers. 

My server uses deflate stream compression.  I have multicore servers.  

My larger problem are the datagram 
values containing strings that fail. 

So with encoding of binary inflated - I would still have to parse each byte.  
To me that is the problem.  


Andrew Brunner
Aurawin LLC
512.574.6298

A safe new way to store and share your files, pictures, videos and more. 

http://aurawin.com

___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/mailman/listinfo/fpc-devel


[fpc-devel] XML Components

2012-11-01 Thread Andrew Brunner

I'm having a problem getting the XML parser to read.

Is there any way I can get the attached program to work by changing a 
parsing option to one less strict.  My XML documents get over 1-2 GBs 
since they represent files.  So having to convert /scan each byte is 
unacceptable.


Is there another XML parser component that can establish a DOM?  Or is 
this a bug in the fpc XML component?


Any help or feedback is entirely welcome and needed.  This data in 
currently in at least 1 stream and failing my cloud desktop sync 
application.


I would really love an option to disable XML byte for byte checking 
during parsing.



--
Andrew Brunner

Aurawin LLC
512.574.6298
http://aurawin.com/

Aurawin is a great new way to store, share, and enjoy your
photos, videos, music and more.

program invalid;
var
  FXMLParser   : TDOMParser;
  FXMLDocument : TXMLDocument;
  FXMLSource   : TXMLInputSource;
  sData: String;
begin
  sData:=Concat(
'?xml version=1.0 encoding='+
'UTF-8',  // native platform is LATIN1
'?',
'value',
'![CDATA[$B0u testt]]',
'/value'
  );
  sData:=System.AnsiToUtf8(sData);
  FXMLParser:=TDOMParser.Create();
  Try
  FXMLSource:=TXMLInputSource.Create(sData);
  Try
FXMLParser.Parse(FXMLSource,FXMLDocument);
Try
Finally
  FXMLDocument.Free();
end;
  Finally
FXMLSource.Free();
  end;

  Finally
FXMLParser.Free();
  end;
end;

end.

___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/mailman/listinfo/fpc-devel