On 2018-05-20 15:43:54 +0200, Karsten Hilbert wrote:
> On Sun, May 20, 2018 at 04:59:12AM -0700, bellcanada...@gmail.com wrote:
> 
> > On Saturday, 19 May 2018 19:48:20 UTC-4, Skip Montanaro  wrote:
> > > As Chris indicated, you'll have to figure out the correct encoding. You
> > > might want to check out the chardet module (available on PyPI, I believe)
> > > and see if it can come up with a better guess. I imagine there are other
> > > encoding guessers out there. That's just one I'm familiar with.
> > 
> > thank you for the reply, but how exactly am i supposed to find oout what is 
> > the correct encodeing??
> 
> One CAN NOT.
> 
> The best you can do is to go ask the canonical source of the
> file what encoding the file is _supposed_ to be in.

I disagree on both counts.

1) For any given file it is almost always possible to find the correct
   encoding (or *a* correct encoding, as there may be more than one).

   This may require domain-specific knowledge (e.g. it may be necessary
   to recognize the human language and know at least some distinctive
   words, or to know some special symbols likely to be used in a data
   file), and it almost always takes a bit of detective work and trial
   and error. But I don't think I ever encountered a file where I
   couldn't figure out the encoding.

   (If you have several files in the same encoding, it may not be
   possible to figure out the encoding from a subset of them. For
   example, the files may all be in ISO-8859-2, but the subset you have
   contains only characters <= 0x7F. But if you have several files, they
   may not all be the same encoding, either).

2) The canonical source of the file may not know. This is quite frequent
   when the source is some non-technical person. Then you get answers
   like "it's ASCII" (although the file contains umlauts, which aren't
   in ASCII) or "it's ANSI" (which isn't an encoding, although Windows
   pretends it is). Or they may not be aware that the file is converted
   somewhere in the pipeline, to that the file they generated isn't
   actually the file you received. So ask (or check the docs), but
   verify!

        hp

-- 
   _  | Peter J. Holzer    | we build much bigger, better disasters now
|_|_) |                    | because we have much more sophisticated
| |   | h...@hjp.at         | management tools.
__/   | http://www.hjp.at/ | -- Ross Anderson <https://www.edge.org/>

Attachment: signature.asc
Description: PGP signature

-- 
https://mail.python.org/mailman/listinfo/python-list

Reply via email to