Re: Document encoding

Koen Van Hooreweghe via 4D_Tech Mon, 13 Jan 2020 09:36:32 -0800

Hi Rudy,

IMHO UTF-8 encoding only makes sense in the context of plain text files 
(character based files like txt, csv, tsv, xml, json, html,...). But it has no 
meaning for binary files (PDF, pictures).
xlsx and docx files are essentially zip archives containing a bunch of xml 
files. For xml files UTF-8 is the default encoding. But you (or you customer) 
should not worry about those.


The real problem arises when trying to import and process plain text files. 
Especially for the high character codes. MacRoman encoding of some characters 
is different than eg Latin-1 and those encodings have a limited range.

Unfortunately, as Lutz also mentions, it is not possible to determine in all 
cases the used character encoding when receiving a text file. The BOM is an 
indication for UTF files, but in my experience rarely used. BOM is not required.

FWIW, BBEdit also guesses what the text file encoding could be. It does a good 
job, but it can be fooled.
Eg. create a new file in BBEdit and set the encoding to Windows Latin-1. Enter 
the text Ë§ and save the file. Close it and reopen it in BBEdit. It will now 
say UTF-8 and show a different content.

HTH
Koen

> Op 10 jan. 2020, om 22:58 heeft Two Way Communications via 4D_Tech 
> <[email protected]> het volgende geschreven:
> 
> If, e.g., I look at a pdf file in BBEdit, it says ‘Mac Roman’.



--------------------
Compass bvba
Koen Van Hooreweghe
Kloosterstraat 65
9910 Aalter
Belgium
tel +32 495 511.653

**********************************************************************
4D Internet Users Group (4D iNUG)
Archive:  http://lists.4d.com/archives.html
Options: https://lists.4d.com/mailman/options/4d_tech
Unsub:  mailto:[email protected]
**********************************************************************

Re: Document encoding

Reply via email to