Re: [fpc-devel] TStringList.LoadFromFile and SavetoFile - file encoding support

2009-02-03 Thread Graeme Geldenhuys
On Tue, Feb 3, 2009 at 9:02 AM, Vincent Snijders
vsnijd...@vodafonevast.nl wrote:

 I am a Lazarus developer, and I don't think I said it like that.

I wasn't pointing fingers to you Vincent. :-) I summarized what a few
people have said.

 LoadFromFile in a LCL control, you need to make sure they are valid UTF8
 strings. And honestly, it is only you who make sure that it is, because you
 know the initial encoding.

The problem is as follows Even though I am a long time developer,
I often have no clue what encoding a file is in when I look at the
file using Nautilus file manager. I often open a file in my preferred
text editor, look if it displays correctly, then look in the statusbar
area for what encoding the editor detected (at least my editor does
that nicely).

So even though you are using something as simple as the TMemo in LCL,
and LCL always wants UTF-8, how do you know what encoding to convert
from to UTF-8? If I give you various text files, each using one of the
following schemes: UTF-16, UTF-16BE, and UTF-16LE, UTF-32 and whatever
else I can find. Loading the file into a TStringList and then doing
UTF8Decode on each line will it display correctly in the TMemo?

Now what if the memo content is changed and then saved?  How does the
TMemo know which encoding to use (I would preferably like the same
encoding as before, not necessarily UTF-8). So if the file was
originally UTF-32, I don't want it to be UTF-8 afterwards.

If the TStringList.LoadFromFile(...) took a encoding parameter, it
could store that encoding option internally, so if you call
.SaveToFile(somefile.txt) later, it could use the same encoding as
used in LoadFromFile(), otherwise default to something like utf-8 if
no encoding was specified anywhere.

Regards,
  - Graeme -


___
fpGUI - a cross-platform Free Pascal GUI toolkit
http://opensoft.homeip.net/fpgui/
___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/mailman/listinfo/fpc-devel


Re: [fpc-devel] TStringList.LoadFromFile and SavetoFile - file encoding support

2009-02-03 Thread Vincent Snijders

Graeme Geldenhuys schreef:

On Tue, Feb 3, 2009 at 9:02 AM, Vincent Snijders
vsnijd...@vodafonevast.nl wrote:

I am a Lazarus developer, and I don't think I said it like that.


I wasn't pointing fingers to you Vincent. :-) I summarized what a few
people have said.


LoadFromFile in a LCL control, you need to make sure they are valid UTF8
strings. And honestly, it is only you who make sure that it is, because you
know the initial encoding.


The problem is as follows Even though I am a long time developer,
I often have no clue what encoding a file is in when I look at the
file using Nautilus file manager. I often open a file in my preferred
text editor, look if it displays correctly, then look in the statusbar
area for what encoding the editor detected (at least my editor does
that nicely).



The LCL does not have this feature. It can only handle UTF8. period.


So even though you are using something as simple as the TMemo in LCL,
and LCL always wants UTF-8, how do you know what encoding to convert
from to UTF-8?


If you don't know, you cannot process it. Simple.


If I give you various text files, each using one of the
following schemes: UTF-16, UTF-16BE, and UTF-16LE, UTF-32 and whatever
else I can find. Loading the file into a TStringList and then doing
UTF8Decode on each line will it display correctly in the TMemo?



For each of these encodings, you would first have to translate it to UTF8, before 
you give it to the LCL. Note that is not wise to load UTF16* and UTF32 encoded files 
into a byte indexed ansistring.



Now what if the memo content is changed and then saved?  How does the
TMemo know which encoding to use (I would preferably like the same
encoding as before, not necessarily UTF-8). So if the file was
originally UTF-32, I don't want it to be UTF-8 afterwards.


If you want it the be the same, then you have to convert it back. You know what it 
was in the first place, because you translated it to UTF8, before giving it to the LCL.



If the TStringList.LoadFromFile(...) took a encoding parameter, it
could store that encoding option internally, so if you call
.SaveToFile(somefile.txt) later, it could use the same encoding as
used in LoadFromFile(), otherwise default to something like utf-8 if
no encoding was specified anywhere.


Maybe. I leave that suggestion to RTL developers. See also Marco's mail.

Vincent
___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/mailman/listinfo/fpc-devel


Re: [fpc-devel] TStringList.LoadFromFile and SavetoFile - file encoding support

2009-02-03 Thread Michael Van Canneyt


On Tue, 3 Feb 2009, Vincent Snijders wrote:

 Graeme Geldenhuys schreef:
  On Tue, Feb 3, 2009 at 9:02 AM, Vincent Snijders
  vsnijd...@vodafonevast.nl wrote:
   I am a Lazarus developer, and I don't think I said it like that.
  
  I wasn't pointing fingers to you Vincent. :-) I summarized what a few
  people have said.
  
   LoadFromFile in a LCL control, you need to make sure they are valid UTF8
   strings. And honestly, it is only you who make sure that it is, because
   you
   know the initial encoding.
  
  The problem is as follows Even though I am a long time developer,
  I often have no clue what encoding a file is in when I look at the
  file using Nautilus file manager. I often open a file in my preferred
  text editor, look if it displays correctly, then look in the statusbar
  area for what encoding the editor detected (at least my editor does
  that nicely).
  
 
 The LCL does not have this feature. It can only handle UTF8. period.
 
  So even though you are using something as simple as the TMemo in LCL,
  and LCL always wants UTF-8, how do you know what encoding to convert
  from to UTF-8?
 
 If you don't know, you cannot process it. Simple.

This is why many editors and mail programs have a menu option 'Encoding':
because they also don't know, and they cannot know without external means,
what the encoding is.

Michael.
___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/mailman/listinfo/fpc-devel


Re: [fpc-devel] TStringList.LoadFromFile and SavetoFile - file encoding support

2009-02-03 Thread Mattias Gaertner
On Tue, 3 Feb 2009 09:39:58 +0100 (CET)
Michael Van Canneyt mich...@freepascal.org wrote:

 
 
 On Tue, 3 Feb 2009, Vincent Snijders wrote:
 
  Graeme Geldenhuys schreef:
   On Tue, Feb 3, 2009 at 9:02 AM, Vincent Snijders
   vsnijd...@vodafonevast.nl wrote:
I am a Lazarus developer, and I don't think I said it like that.
   
   I wasn't pointing fingers to you Vincent. :-) I summarized what a
   few people have said.
   
LoadFromFile in a LCL control, you need to make sure they are
valid UTF8 strings. And honestly, it is only you who make sure
that it is, because you
know the initial encoding.
   
   The problem is as follows Even though I am a long time
   developer, I often have no clue what encoding a file is in when I
   look at the file using Nautilus file manager. I often open a file
   in my preferred text editor, look if it displays correctly, then
   look in the statusbar area for what encoding the editor detected
   (at least my editor does that nicely).
   
  
  The LCL does not have this feature. It can only handle UTF8. period.
  
   So even though you are using something as simple as the TMemo in
   LCL, and LCL always wants UTF-8, how do you know what encoding to
   convert from to UTF-8?
  
  If you don't know, you cannot process it. Simple.
 
 This is why many editors and mail programs have a menu option
 'Encoding': because they also don't know, and they cannot know
 without external means, what the encoding is.

Let's add that to TMemo. ;)

Mattias
___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/mailman/listinfo/fpc-devel


Re: [fpc-devel] TStringList.LoadFromFile and SavetoFile - file encoding support

2009-02-03 Thread Graeme Geldenhuys
On Tue, Feb 3, 2009 at 10:44 AM, Mattias Gaertner
nc-gaert...@netcologne.de wrote:

 This is why many editors and mail programs have a menu option
 'Encoding': because they also don't know, and they cannot know
 without external means, what the encoding is.

 Let's add that to TMemo. ;)

The point is that you will probably have a File Open dialog that gives
the filename to TMemo.LoadFromFile. The the file dialog could collect
the filename and optional encoding to pass on to LoadFromFile.

Even if TStringList has the optional encoding parameter, prior source
code should still work as-is without the encoding parameter. No code
would be broken.

I agree with Marco that auto detecting encodings is probably not a
good idea in the RTL, but at least enable the option of a encoding
parameter in TStringList, which could help things along. As in the
case of the bug report.

Regards,
  - Graeme -


___
fpGUI - a cross-platform Free Pascal GUI toolkit
http://opensoft.homeip.net/fpgui/
___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/mailman/listinfo/fpc-devel


Re: [fpc-devel] TStringList.LoadFromFile and SavetoFile - file encoding support

2009-02-03 Thread Graeme Geldenhuys
On Tue, Feb 3, 2009 at 1:33 PM, Sergei Gorelkin sergei_gorel...@mail.ru wrote:

 Strings.LoadFromStream(TDecodingStream.Create(TFileStream.Create('myfile'),
 'cp866', 'utf-8'));

 This approach isn't limited to decoding, you can do decrypting, compressing,
 etc.


That's actually a very clever idea.  :)


Regards,
  - Graeme -


___
fpGUI - a cross-platform Free Pascal GUI toolkit
http://opensoft.homeip.net/fpgui/
___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/mailman/listinfo/fpc-devel


Re: [fpc-devel] TStringList.LoadFromFile and SavetoFile - file encoding support

2009-02-03 Thread Graeme Geldenhuys
On Tue, Feb 3, 2009 at 1:00 PM, Michael Schnell mschn...@lumino.de wrote:
 Would it not be necessary to define as well the encoding of the file as the
 encoding you want to have for the strings when accessing them ?

I guess that would be taken care of if Free Pascal has a fully working
UnicodeString type. I don't know what's the status of that in the
2.3.x code.


Regards,
  - Graeme -


___
fpGUI - a cross-platform Free Pascal GUI toolkit
http://opensoft.homeip.net/fpgui/
___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/mailman/listinfo/fpc-devel


Re: [fpc-devel] TStringList.LoadFromFile and SavetoFile - file encoding support

2009-02-03 Thread Michael Schnell



I guess that would be taken care of if Free Pascal has a fully working
UnicodeString type. I don't know what's the status of that in the
2.3.x code.
  
Of course (as already discussed several times). Even if the state is not 
known, do we know the final goal ? Will be _only_one_ UnicodeString type 
or will there still be things like ANSIString holding UTF8 ? Did the 
powers agree on a single white paper on this ?


-Michael

___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/mailman/listinfo/fpc-devel


Re: [fpc-devel] TStringList.LoadFromFile and SavetoFile - file encoding support

2009-02-03 Thread Michael Van Canneyt


On Tue, 3 Feb 2009, Sergei Gorelkin wrote:

 Graeme Geldenhuys wrote:
 
  The point is that you will probably have a File Open dialog that gives
  the filename to TMemo.LoadFromFile. The the file dialog could collect
  the filename and optional encoding to pass on to LoadFromFile.
  
  Even if TStringList has the optional encoding parameter, prior source
  code should still work as-is without the encoding parameter. No code
  would be broken.
  
  I agree with Marco that auto detecting encodings is probably not a
  good idea in the RTL, but at least enable the option of a encoding
  parameter in TStringList, which could help things along. As in the
  case of the bug report.
  
 
 There is no need for TStrings.LoadFromFile method at all.
 The container is one matter, and its serialization is a separate one. Just
 introduce a decoding stream and you can write e.g:
 
 Strings.LoadFromStream(TDecodingStream.Create(TFileStream.Create('myfile'),
 'cp866', 'utf-8'));
 
 This approach isn't limited to decoding, you can do decrypting, compressing,
 etc.
 In reality, of course, you have to finalize all the stuff - that means typing
 more than one line. C++ language that automatically finalizes on-stack objects
 is more friendly in this respect.

That's easily done:
- make TDecodingStream descendent of TOwnerStream (exists) and TFileStream will
  be freed.
- Add a second parameter to LoadFromStream(AStream : TStream; FreeStream = 
False)
  and your call becomes
  
Strings.LoadFromStream(TDecodingStream.Create(TFileStream.Create('myfile'),'cp866',
 'utf-8'),True);

Michael.
___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/mailman/listinfo/fpc-devel


Re: [fpc-devel] TStringList.LoadFromFile and SavetoFile - file encoding support

2009-02-03 Thread Michael Schnell
Would it not be necessary to define as well the encoding of the file as 
the encoding you want to have for the strings when accessing them ?


-Michael
___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/mailman/listinfo/fpc-devel


Re: [fpc-devel] TStringList.LoadFromFile and SavetoFile - file encoding support

2009-02-03 Thread Sergei Gorelkin

Graeme Geldenhuys wrote:


The point is that you will probably have a File Open dialog that gives
the filename to TMemo.LoadFromFile. The the file dialog could collect
the filename and optional encoding to pass on to LoadFromFile.

Even if TStringList has the optional encoding parameter, prior source
code should still work as-is without the encoding parameter. No code
would be broken.

I agree with Marco that auto detecting encodings is probably not a
good idea in the RTL, but at least enable the option of a encoding
parameter in TStringList, which could help things along. As in the
case of the bug report.



There is no need for TStrings.LoadFromFile method at all.
The container is one matter, and its serialization is a separate one. 
Just introduce a decoding stream and you can write e.g:


Strings.LoadFromStream(TDecodingStream.Create(TFileStream.Create('myfile'), 
'cp866', 'utf-8'));


This approach isn't limited to decoding, you can do decrypting, 
compressing, etc.
In reality, of course, you have to finalize all the stuff - that means 
typing more than one line. C++ language that automatically finalizes 
on-stack objects is more friendly in this respect.


Regards,
Sergei
___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/mailman/listinfo/fpc-devel


Re: [fpc-devel] TStringList.LoadFromFile and SavetoFile - file encoding support

2009-02-02 Thread Vincent Snijders

Graeme Geldenhuys schreef:

Hi,

I just read all the comments about the following bug report in filed
under the Lazarus project.
  http://bugs.freepascal.org/view.php?id=12676

The comments posted doesn't seem sufficient to me.  If a user selects
a file to be loaded, they have no clue if that file is ANSI, UTF-8,
UTF-16 etc encoded. The suggestion by the Lazarus developers is to
ALWAYS assume the file is in UTF-8 (just because LCL uses UTF-8
internally) and to do a UTF8Encode on each line of the file. So what
happens if you do a .SavetoFile(...)?  Must you UTF8Decode each line
again??


I am a Lazarus developer, and I don't think I said it like that.

What I mean is:
If you load a file using LoadFromFile, the lines of the file are loaded 
in ansistrings. No conversion is done by the RTL, so the encoding 
remains the same as is in the file.


Now, the LCL is very picky about its encoding, it wants always UTF8 
encoded strings. It is not a chameleon like that RTL that changes its 
encoding according to the systems settings. If you want to show strings 
loaded by LoadFromFile in a LCL control, you need to make sure they are 
valid UTF8 strings. And honestly, it is only you who make sure that it 
is, because you know the initial encoding.


Vincent
___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/mailman/listinfo/fpc-devel


Re: [fpc-devel] TStringList.LoadFromFile and SavetoFile - file encoding support

2009-02-02 Thread Marco van de Voort
In our previous episode, Graeme Geldenhuys said:
 This supposed solution fails horribly in practice. What if the file
 was UTF-16 encoded?

Then you can't load it into a ansistring tstringlist, since ansistring is
one char per default.

 I believe Delphi 2009 extended the .LoadFromFile(...) and
 .SaveToFile(...) methods with an optional encoding parameter.

Delphi (and 2.4 in the future) are totally separate things, since they
actually have UTF-8 support. Currently 2.2.x only supports UTF-16
(widestring) and ansistring (in the native encoding), the rest is bolted on
at best.

 Could something like this be added to TStringList etc? 

In time, when D2009 compatibility is added yes. But not for a crur

 I guess we would also need some auto encoding detection in place.

Never for a basic RTL primitive. That is something for editor programs, not
library routines, iow this must be handled at a different level as the base
RTL.

 How do other text editors managed to auto detect the file encodings - to a
 degree of accuracy? 

Start decoding the various types and count errors would be my first guess.
Maybe heuristics are involved.

 Also, if the .LoadFromFile(...) and .SaveToFile(...) methods were
 extended, then we (GUI toolkit developers) could extend File Open and File
 Save dialogs like Qt has done.

 If the auto encoding detection didn't work, the user can use the combobox
 in the file open/save dialog to specify a encoding to use. Web Browsers
 have a similar feature when displaying HTML.

Autodetection Is the job of the toolkit developer, not of the base RTL.
___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/mailman/listinfo/fpc-devel