Re: [fpc-devel] Unit for handling UTF-8 strings

2013-04-09 Thread Michael Schnell

On 04/08/2013 07:02 PM, Mattias Gaertner wrote:

I guess, you mean encoded string types.


AFAIK, you can just create string variables of the appropriate coding 
type and an assignment will do auto-conversion.


-Michael
___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/mailman/listinfo/fpc-devel


Re: [fpc-devel] Unit for handling UTF-8 strings

2013-04-09 Thread Mattias Gaertner
On Tue, 09 Apr 2013 08:24:11 +0200
Michael Schnell mschn...@lumino.de wrote:

 On 04/08/2013 07:02 PM, Mattias Gaertner wrote:
  I guess, you mean encoded string types.
 
 AFAIK, you can just create string variables of the appropriate coding 
 type and an assignment will do auto-conversion.

Yes.
But how do you examine the characters?
If I understand Michael right, there will be some implicit functions
for that. I wonder how they work.


Mattias
___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/mailman/listinfo/fpc-devel


Re: [fpc-devel] Unit for handling UTF-8 strings

2013-04-09 Thread Michael Van Canneyt



On Tue, 9 Apr 2013, Mattias Gaertner wrote:


On Tue, 09 Apr 2013 08:24:11 +0200
Michael Schnell mschn...@lumino.de wrote:


On 04/08/2013 07:02 PM, Mattias Gaertner wrote:

I guess, you mean encoded string types.


AFAIK, you can just create string variables of the appropriate coding
type and an assignment will do auto-conversion.


Yes.
But how do you examine the characters?
If I understand Michael right, there will be some implicit functions
for that. I wonder how they work.


See the character unit:

 // flat functions
  function ConvertFromUtf32(AChar : UCS4Char) : UnicodeString;
  function ConvertToUtf32(const AString : UnicodeString; AIndex : Integer) : 
UCS4Char; overload;
  function ConvertToUtf32(const AString : UnicodeString; AIndex : Integer; out 
ACharLength : Integer) : UCS4Char; overload;
  function ConvertToUtf32(const AHighSurrogate, ALowSurrogate : UnicodeChar) : 
UCS4Char; overload;
  function GetNumericValue(AChar : UnicodeChar) : Double; overload;
  function GetNumericValue(const AString : UnicodeString; AIndex : Integer) : 
Double; overload;
  function GetUnicodeCategory(AChar : UnicodeChar) : TUnicodeCategory; overload;
  function GetUnicodeCategory(const AString : UnicodeString; AIndex : Integer) 
: TUnicodeCategory; overload;
  function IsControl(AChar : UnicodeChar) : Boolean; overload;
  function IsControl(const AString : UnicodeString; AIndex : Integer) : 
Boolean; overload;
  function IsDigit(AChar : UnicodeChar) : Boolean; overload;
  function IsDigit(const AString : UnicodeString; AIndex : Integer) : Boolean; 
overload;
  function IsSurrogate(AChar : UnicodeChar) : Boolean; overload;
  function IsSurrogate(const AString : UnicodeString; AIndex : Integer) : 
Boolean; overload;
  function IsHighSurrogate(AChar : UnicodeChar) : Boolean; overload;
  function IsHighSurrogate(const AString : UnicodeString; AIndex : Integer) : 
Boolean; overload;
  function IsLowSurrogate(AChar : UnicodeChar) : Boolean; overload;
  function IsLowSurrogate(const AString : UnicodeString; AIndex : Integer) : 
Boolean; overload;
  function IsSurrogatePair(const AHighSurrogate, ALowSurrogate : UnicodeChar) : 
Boolean; overload;
  function IsSurrogatePair(const AString : UnicodeString; AIndex : Integer) : 
Boolean; overload;
  function IsLetter(AChar : UnicodeChar) : Boolean; overload;
  function IsLetter(const AString : UnicodeString; AIndex : Integer) : Boolean; 
overload;
  function IsLetterOrDigit(AChar : UnicodeChar) : Boolean; overload;
  function IsLetterOrDigit(const AString : UnicodeString; AIndex : Integer) : 
Boolean; overload;
  function IsLower(AChar : UnicodeChar) : Boolean; overload;
  function IsLower(const AString : UnicodeString; AIndex : Integer) : Boolean; 
overload;
  function IsNumber(AChar : UnicodeChar) : Boolean; overload;
  function IsNumber(const AString : UnicodeString; AIndex : Integer) : Boolean; 
overload;
  function IsPunctuation(AChar : UnicodeChar) : Boolean; overload;
  function IsPunctuation(const AString : UnicodeString; AIndex : Integer) : 
Boolean; overload;
  function IsSeparator(AChar : UnicodeChar) : Boolean; overload;
  function IsSeparator(const AString : UnicodeString; AIndex : Integer) : 
Boolean; overload;
  function IsSymbol(AChar : UnicodeChar) : Boolean; overload;
  function IsSymbol(const AString : UnicodeString; AIndex : Integer) : Boolean; 
overload;
  function IsUpper(AChar : UnicodeChar) : Boolean; overload;
  function IsUpper(const AString : UnicodeString; AIndex : Integer) : Boolean; 
overload;
  function IsWhiteSpace(AChar : UnicodeChar) : Boolean; overload;
  function IsWhiteSpace(const AString : UnicodeString; AIndex : Integer) : 
Boolean; overload;
  function ToLower(AChar : UnicodeChar) : UnicodeChar; overload;
  function ToLower(const AString : UnicodeString) : UnicodeString; overload;
  function ToUpper(AChar : UnicodeChar) : UnicodeChar; overload;
  function ToUpper(const AString : UnicodeString) : UnicodeString; overload;
___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/mailman/listinfo/fpc-devel


Re: [fpc-devel] Unit for handling UTF-8 strings

2013-04-09 Thread Mattias Gaertner
On Tue, 9 Apr 2013 08:55:15 +0200 (CEST)
Michael Van Canneyt mich...@freepascal.org wrote:

 
 
 On Tue, 9 Apr 2013, Mattias Gaertner wrote:
 
  On Tue, 09 Apr 2013 08:24:11 +0200
  Michael Schnell mschn...@lumino.de wrote:
 
  On 04/08/2013 07:02 PM, Mattias Gaertner wrote:
  I guess, you mean encoded string types.
 
  AFAIK, you can just create string variables of the appropriate coding
  type and an assignment will do auto-conversion.
 
  Yes.
  But how do you examine the characters?
  If I understand Michael right, there will be some implicit functions
  for that. I wonder how they work.
 
 See the character unit:

Nice!

Why do you call them implicit calls?

Will there be UTF-8 functions too or do you have to convert
to UnicodeString?

Will there be PUnicodeChar functions too?
 
   // flat functions
function ConvertFromUtf32(AChar : UCS4Char) : UnicodeString;
function ConvertToUtf32(const AString : UnicodeString; AIndex : Integer) : 
 UCS4Char; overload;
function ConvertToUtf32(const AString : UnicodeString; AIndex : Integer; 
 out ACharLength : Integer) : UCS4Char; overload;
function ConvertToUtf32(const AHighSurrogate, ALowSurrogate : UnicodeChar) 
 : UCS4Char; overload;
[...]

Mattias
___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/mailman/listinfo/fpc-devel


Re: [fpc-devel] Unit for handling UTF-8 strings

2013-04-09 Thread Paul Ishenin

09.04.2013 15:13, Mattias Gaertner wrote:


Will there be UTF-8 functions too or do you have to convert
to UnicodeString?


At the moment TCharacter contains methods which delphi TCharacter has. 
If there is demand we will add UTF8 overloads.



Will there be PUnicodeChar functions too?


   // flat functions
function ConvertFromUtf32(AChar : UCS4Char) : UnicodeString;
function ConvertToUtf32(const AString : UnicodeString; AIndex : Integer) : 
UCS4Char; overload;
function ConvertToUtf32(const AString : UnicodeString; AIndex : Integer; 
out ACharLength : Integer) : UCS4Char; overload;
function ConvertToUtf32(const AHighSurrogate, ALowSurrogate : UnicodeChar) 
: UCS4Char; overload;


Best regards,
Paul Ishenin


___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/mailman/listinfo/fpc-devel


Re: [fpc-devel] Unit for handling UTF-8 strings

2013-04-09 Thread Michael Schnell

On 04/09/2013 08:49 AM, Mattias Gaertner wrote:
But how do you examine the characters? 


Even defining what a character is, is extremely problematic with any use 
of Unicode. Regarding that a printable character can be assembled by 
multiple of the (nearly 2^32) Unicode codes, and a single Unicode 
codes is represented by 1, 2, 3, or 4 Bytes when using UTF-8 or UTF-16 
encoding, and now the order of those bytes depends on the CPU-arch 
and/or the file the string is imported from and the way it is imported.
This of course is not a problem introduced by fpc, but the perfectly 
normal complexity of Unicode.


If I understand Michael right, there will be some implicit functions 
for that. I wonder how they work. 


This is what Delphi compatibility dictated. (You might read the Delphi 
XE Docs on how to code Unicode enabled Delphi source.)


I do hope, fpc avoids some of the quirks Delphi introduces and offers 
some useful additional features (e.g. dedicated string types such as 
unencoded (raw, never auto-converted) Byte, Word and DWord Strings, and 
a flexible encoded String type, that inherit the encoding scheme from 
the source string when doing an assignment or using them as a function 
parameter, doing auto-conversion whenever dynamically necessary.


-Michael
___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/mailman/listinfo/fpc-devel


Re: [fpc-devel] Unit for handling UTF-8 strings

2013-04-09 Thread Mattias Gaertner

 Paul Ishenin i...@kmiac.ru hat am 9. April 2013 um 09:20 geschrieben:


 09.04.2013 15:13, Mattias Gaertner wrote:

  Will there be UTF-8 functions too or do you have to convert
  to UnicodeString?

 At the moment TCharacter contains methods which delphi TCharacter has.
 If there is demand we will add UTF8 overloads.

Demand+=1


  Will there be PUnicodeChar functions too?

Well?

Mattias
___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/mailman/listinfo/fpc-devel


Re: [fpc-devel] Unit for handling UTF-8 strings

2013-04-09 Thread Sven Barth

Am 09.04.2013 10:30, schrieb Mattias Gaertner:

Paul Ishenin i...@kmiac.ru hat am 9. April 2013 um 09:20 geschrieben:


09.04.2013 15:13, Mattias Gaertner wrote:


Will there be UTF-8 functions too or do you have to convert
to UnicodeString?

At the moment TCharacter contains methods which delphi TCharacter has.
If there is demand we will add UTF8 overloads.

Demand+=1

(1,8) Error: Illegal expression
(1,9) Error: Illegal expression
(1,9) Fatal: Syntax error, ; expected but ordinal const found

(Sorry, had to be said :P )

Regards,
Sven
___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/mailman/listinfo/fpc-devel


Re: [fpc-devel] Unit for handling UTF-8 strings

2013-04-09 Thread Paul Ishenin

09.04.2013 17:10, Sven Barth пишет:


Demand+=1

(1,8) Error: Illegal expression
(1,9) Error: Illegal expression
(1,9) Fatal: Syntax error, ; expected but ordinal const found

(Sorry, had to be said :P )


Also, Patches variable seems to be equal to zero. And assigning Demand 
without assigning Patches has almost no effect :)


Best regards,
Paul Ishenin

___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/mailman/listinfo/fpc-devel


Re: [fpc-devel] Unit for handling UTF-8 strings

2013-04-09 Thread Mattias Gaertner

 Paul Ishenin i...@kmiac.ru hat am 9. April 2013 um 11:23 geschrieben:


 09.04.2013 17:10, Sven Barth пишет:

  Demand+=1
  (1,8) Error: Illegal expression
  (1,9) Error: Illegal expression
  (1,9) Fatal: Syntax error, ; expected but ordinal const found
 
  (Sorry, had to be said :P )

 Also, Patches variable seems to be equal to zero. And assigning Demand
 without assigning Patches has almost no effect :)

Creating a patch is not hard. The lazutf8 already contains the code. But I have
no idea how the the interface should look like. TCharacter is a Delphi class and
Delphi does not have UTF-8 functions. Michael wrote that these functions are
implicit, so maybe these functions need to fit some form?
In other words:
Are there any suggestions, recommendations how the UTF-8 functions should look
like?

Mattias
___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/mailman/listinfo/fpc-devel


Re: [fpc-devel] Unit for handling UTF-8 strings

2013-04-09 Thread Paul Ishenin

09.04.2013 18:09, Mattias Gaertner пишет:


Creating a patch is not hard. The lazutf8 already contains the code. But I have
no idea how the the interface should look like. TCharacter is a Delphi class and
Delphi does not have UTF-8 functions. Michael wrote that these functions are
implicit, so maybe these functions need to fit some form?
In other words:
Are there any suggestions, recommendations how the UTF-8 functions should look
like?


Let's see.

The next function should stay as is. Compiler will add required implicit 
conversion when you assign result to UTF8String variable.


function ConvertFromUtf32(AChar : UCS4Char) : UnicodeString;


Here you can add UTF8String overloads if needed:

function ConvertToUtf32(const AString : UnicodeString; AIndex : Integer) 
: UCS4Char; overload;
function ConvertToUtf32(const AString : UnicodeString; AIndex : Integer; 
out ACharLength : Integer) : UCS4Char; overload;


At the same time even without UTF8 overloads compiler will insert 
implicit conversion from UTF8String to UnicodeString when you pass it to 
that functions. So UTF8 overloads can only increase spead by removing 1 
implicit conversion.


Best regards,
Paul Ishenin

___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/mailman/listinfo/fpc-devel


Re: [fpc-devel] Unit for handling UTF-8 strings

2013-04-08 Thread Mattias Gaertner
On Sun, 7 Apr 2013 20:18:51 +0200 (CEST)
Michael Van Canneyt mich...@freepascal.org wrote:

[...] 
 FPC is preparing for a complete unicode solution, with proper language 
 support. 

Great.
I guess, you mean encoded string types.

But even then, FPC should contain UTF-8 and UTF-16 functions.

 At best, these units are a temporary solution.

You might be right about the UTF-8 classes like TFileStreamUTF8 (I
really hope).
But I doubt that about basic UTF-8/16 functions.

Mattias
___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/mailman/listinfo/fpc-devel


Re: [fpc-devel] Unit for handling UTF-8 strings

2013-04-08 Thread Michael Van Canneyt



On Mon, 8 Apr 2013, Mattias Gaertner wrote:


On Sun, 7 Apr 2013 20:18:51 +0200 (CEST)
Michael Van Canneyt mich...@freepascal.org wrote:


[...]
FPC is preparing for a complete unicode solution, with proper language support.


Great.
I guess, you mean encoded string types.

But even then, FPC should contain UTF-8 and UTF-16 functions.


Why ? The necessary functionality will be implicit in the various calls.


At best, these units are a temporary solution.


You might be right about the UTF-8 classes like TFileStreamUTF8 (I
really hope).
But I doubt that about basic UTF-8/16 functions.


Such as ?

Michael.
___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/mailman/listinfo/fpc-devel


Re: [fpc-devel] Unit for handling UTF-8 strings

2013-04-08 Thread Mattias Gaertner
On Mon, 8 Apr 2013 23:13:17 +0200 (CEST)
Michael Van Canneyt mich...@freepascal.org wrote:

 
 
 On Mon, 8 Apr 2013, Mattias Gaertner wrote:
 
  On Sun, 7 Apr 2013 20:18:51 +0200 (CEST)
  Michael Van Canneyt mich...@freepascal.org wrote:
 
  [...]
  FPC is preparing for a complete unicode solution, with proper language 
  support.
 
  Great.
  I guess, you mean encoded string types.
 
  But even then, FPC should contain UTF-8 and UTF-16 functions.
 
 Why ? The necessary functionality will be implicit in the various calls.

What calls?

 
  At best, these units are a temporary solution.
 
  You might be right about the UTF-8 classes like TFileStreamUTF8 (I
  really hope).
  But I doubt that about basic UTF-8/16 functions.
 
 Such as ?

Functions like determining the number of bytes of an UTF-8 character
or checking if it is a valid character.

Mattias
___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/mailman/listinfo/fpc-devel


Re: [fpc-devel] Unit for handling UTF-8 strings

2013-04-07 Thread Jonas Maebe

On 07 Apr 2013, at 13:35, Kostas Michalopoulos wrote:

 But still no UTF8 in FPC, despite all the different implementations
 floating around out there and despite UTF8 being the most important Unicode
 encoding (being used by practically anything that doesn't falsely believe
 that 16bit integers are enough).

Why is it that no debate about unicode can ever be held without adding 
flamebait about either UTF-8 or UTF-16? Please don't react to the above part of 
that message.


Jonas
FPC mailing lists admin___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/mailman/listinfo/fpc-devel


Re: [fpc-devel] Unit for handling UTF-8 strings

2013-04-07 Thread Mattias Gaertner
On Sun, 7 Apr 2013 13:35:40 +0200
Kostas Michalopoulos badsectorac...@gmail.com wrote:

[...]I looked around in FPC 2.6.2's units and found nothing beyond
 utf8encode/decode (which in linux requires a C widestring manager that i'd
 like to avoid... and doesn't really help in all cases since Unicode can
 exceed the 16bit range).

It does not require a widestring manager.

 
 Searching in Google i found a discussion from 2007 which basically
 concluded to yeah, it is a nice feature, has some warts, but people need
 it but didn't went anywhere
 http://free-pascal-general.1045716.n5.nabble.com/UTF-8-versions-of-Copy-and-Length-td2814536.html
 and
 the apparent lack of a UTF8 unit in FPC six years later (even for basic
 stuff like copy, length, etc) means that that unit never came to exist.
 
 So, what is going on with that? Graeme mentioned that he already had some
 code and knew some other library that provided a more complete solution

See for example the Lazarus lazutf8 unit.


 that could be imported in FPC and even another guy had yet another library.
 But still no UTF8 in FPC, despite all the different implementations
 floating around out there and despite UTF8 being the most important Unicode
 encoding (being used by practically anything that doesn't falsely believe
 that 16bit integers are enough).

AFAIK there are no UTF16 functions either.
Lazarus provides a lazutf16 unit too.

 
 Personally i coded yet another unit, which you can find here:
 http://pastebin.com/cJ2TvRdZ
 
 Of course my code is most likely slow and there might be some bugs there -
 i only did some testing with Greek characters which seem to work fine, but
 nothing like Chinese or the new emoticon stuff which is regularly added in
 Unicode.

I agree, your unit is most likely slow.

 
[...]

Mattias
___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/mailman/listinfo/fpc-devel


Re: [fpc-devel] Unit for handling UTF-8 strings

2013-04-07 Thread Kostas Michalopoulos
On Sun, Apr 7, 2013 at 5:59 PM, Mattias Gaertner
nc-gaert...@netcologne.dewrote:

 On Sun, 7 Apr 2013 13:35:40 +0200
 Kostas Michalopoulos badsectorac...@gmail.com wrote:

 [...]I looked around in FPC 2.6.2's units and found nothing beyond
  utf8encode/decode (which in linux requires a C widestring manager that
 i'd
  like to avoid... and doesn't really help in all cases since Unicode can
  exceed the 16bit range).

 It does not require a widestring manager.


The documentation says otherwise:
http://www.freepascal.org/docs-html/rtl/system/utf8encode.html
For this function to work, a widestring manager must be installed.


 See for example the Lazarus lazutf8 unit.
 AFAIK there are no UTF16 functions either.
 Lazarus provides a lazutf16 unit too.


Is there a reason for those to be in Lazarus only? Can they be moved to FPC?
___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/mailman/listinfo/fpc-devel


Re: [fpc-devel] Unit for handling UTF-8 strings

2013-04-07 Thread Michael Van Canneyt



On Sun, 7 Apr 2013, Kostas Michalopoulos wrote:



On Sun, Apr 7, 2013 at 5:59 PM, Mattias Gaertner nc-gaert...@netcologne.de 
wrote:
  On Sun, 7 Apr 2013 13:35:40 +0200
  Kostas Michalopoulos badsectorac...@gmail.com wrote:

  [...]I looked around in FPC 2.6.2's units and found nothing beyond
   utf8encode/decode (which in linux requires a C widestring manager that 
i'd
   like to avoid... and doesn't really help in all cases since Unicode can
   exceed the 16bit range).

It does not require a widestring manager.


The documentation says otherwise:
http://www.freepascal.org/docs-html/rtl/system/utf8encode.html
For this function to work, a widestring manager must be installed.


See for example the Lazarus lazutf8 unit.
AFAIK there are no UTF16 functions either.
Lazarus provides a lazutf16 unit too.


Is there a reason for those to be in Lazarus only? Can they be moved to FPC?


FPC is preparing for a complete unicode solution, with proper language support. 
At best, these units are a temporary solution.


Michael.
___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/mailman/listinfo/fpc-devel