On 8/22/2014 11:50 AM, Oleg Broytman wrote:
On Fri, Aug 22, 2014 at 10:09:21AM -0700, Glenn Linderman 
<v+pyt...@g.nevcal.com> wrote:
On 8/22/2014 9:52 AM, Oleg Broytman wrote:
On Fri, Aug 22, 2014 at 09:37:13AM -0700, Glenn Linderman 
<v+pyt...@g.nevcal.com> wrote:
On 8/22/2014 8:51 AM, Oleg Broytman wrote:
    What encoding does have a text file (an HTML, to be precise) with
text in utf-8, ads in cp1251 (ad blocks were included from different
files) and comments in koi8-r?
    Well, I must admit the HTML was rather an exception, but having a
text file with some strange characters (binary strings, or paragraphs
in different encodings) is not that exceptional.
That's not a text file. That's a binary file containing (hopefully
delimited, and documented) sections of encoded text in different
encodings.
    Allow me to disagree. For me, this is a text file which I can (and
do) view with a pager, edit with a text editor, list on a console,
search with grep and so on. If it is not a text file by strict Python3
standards then these standards are too strict for me. Either I find a
simple workaround in Python3 to work with such texts or find a different
tool. I cannot avoid such files because my reality is much more complex
than strict text/binary dichotomy in Python3.
I was not declaring your file not to be a "text file" from any
definition obtained from Python3 documentation, just from a common
sense definition of "text file".
    And in my opinion those files are perfect text. The files consist of
lines separated by EOL characters (not necessary EOL characters of my OS
because it could be a text file produced in a different OS), lines
consist of words and words of characters.

Until you know or can deduce the encoding of a file, it is binary. If it has multiple, different, embedded encodings of text, it is still binary. In my opinion. So these are just opinions, and naming conventions. If you call it text, you have a different definition of text file than I do.


Looking at it from Python3, though, it is clear that when opening a
file in "text" mode, an encoding may be specified or will be
assumed.  That is one encoding, applying to the whole file, not 3
encodings, with declarations on when to switch between them. So I
think, in general, Python3 assumes or defines a definition of text
file that matches my "common sense" definition.
    I don't have problems with Python3 text. I have problems with Python3
trying to get rid of byte strings and treating bytes as strict non-text.

Python3 is not trying to get rid of byte strings. But to some extent, it is wanting to treat bytes as non-text... bytes can be encoded text, but is not text until it is decoded. There is some processing that can be done on encoded text, but it has to be done differently (in many cases) than processing done on (non-encoded) text.

One difference is the interpretation of what character is what varies from encoding to encoding, so if the processing requires understanding the characters, then the character code must be known.

On the other hand, if it suffices to detect blocks of opaque text delimited by a known set of delimiters codes (EOL: CR, LF, combinations thereof) then that can be done relatively easily on binary, as long as the encoding doesn't have data puns where a multibyte encoded character might contain the code for the delimiter as one of the bytes of the code for the character.

On the other hand, Python3 provides various facilities for working
with such files.

The first I'll mention is the one that follows from my description
of what your file really is: Python3 allows opening files in binary
mode, and then decoding various sections of it using whatever
encoding you like, using the bytes.decode() operation on various
sections of the file. Determination of which sections are in which
encodings is beyond the scope of this description of the technique,
and is application dependent.
    This is perhaps the most promising approach. If I can open a text
file in binary mode, iterate it line by line, split every line of
non-ascii bytes with .split() and process them that'd satisfy my needs.
    But still there are dragons. If I read a filename from such file I
read it as bytes, not str, so I can only use low-level APIs to
manipulate with those filenames. Pity.

If the file names are in an unknown encoding, both in the directory and in the encoded text in the file listing, then unless you can deduce the encoding, you would be limited to doing manipulations with file APIs that support bytes, the low-level ones, yes. If you can deduce the encoding, then you are freed from that limitation.

    Let see a perfectly normal situation I am quite often in. A person
sent me a directory full of MP3 files. The transport doesn't matter; it
could be FTP, or rsync, or a zip file sent by email, or bittorrent. What
matters is that filenames and content are in alien encodings. Most often
it's cp1251 (the encoding used in Russian Windows) but can be koi8 or
utf8. There is a playlist among the files -- a text file that lists MP3
files, every file on a single line; usually with full paths
("C:\Audio\some.mp3").
    Now I want to read filenames from the file and process the filenames
(strip paths) and files (verify existing of files, or renumber the files
or extract ID3 tags [Russian ID3 tags, whatever ID3 standard says, are
also in cp1251 of utf-8 encoding]...whatever).

"cp1251 of utf-8 encoding" is non-sensical. Either it is cp1251 or it is utf-8, but it is not both. Maybe you meant "or" instead of "of".

  I don't know the encoding
of the playlist but I know it corresponds to the encoding of filenames
so I can expect those files exist on my filesystem; they have strangely
looking unreadable names but they exist.
    Just a small example of why I do want to process filenames from a
text file in an alien encoding. Without knowing the encoding in advance.

An interesting example, for sure. Life will be easier when everyone converts to Unicode and UTF-8.


The second is to specify an error handler, that, like you, is
trained to recognize the other encodings and convert them
appropriately. I'm not aware that such an error handler has been or
could be written, myself not having your training.

The third is to specify the UTF-8 with the surrogate escape error
handler. This allows non-UTF-8 codes to be loaded into memory. You,
or algorithms as smart as you, could perhaps be developed to detect
and manipulate the resulting "lone surrogate" codes in meaningful
ways, or could simply allow them to ride along without
interpretation, and be emitted as the original, into other files.
    Yes, these are different workarounds.

Oleg.

_______________________________________________
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Reply via email to