On Mon, 2 May 2022 at 09:19, Dan Stromberg <drsali...@gmail.com> wrote: > > On Sun, May 1, 2022 at 3:19 PM Cameron Simpson <c...@cskk.id.au> wrote: > > > On 01May2022 18:55, Marco Sulla <marco.sulla.pyt...@gmail.com> wrote: > > >Something like this is OK? > > > > Scanning backward for a byte == 10 in ASCII or ISO-8859 seems fine. > > But what about Unicode? Are all 10 bytes newlines in Unicode encodings?
Most absolutely not. "Unicode" isn't an encoding, but of the Unicode Transformation Formats and Universal Character Set encodings, most don't make that guarantee: * UTF-8 does, as mentioned. It sacrifices some efficiency and consistency for a guarantee that ASCII characters are represented by ASCII bytes, and ASCII bytes only ever represent ASCII characters. * UCS-2 and UTF-16 will both represent BMP characters with two bytes. Any character U+xx0A or U+0Axx will include an 0x0A in its representation. * UTF-16 will also encode anything U+000xxx0A with an 0x0A. (And I don't think any codepoints have been allocated that would trigger this, but UTF-16 can also use 0x0A in the high surrogate.) * UTF-32 and UCS-4 will use 0x0A for any character U+xx0A, U+0Axx, and U+Axxxx (though that plane has no characters on it either) So, of all the available Unicode standard encodings, only UTF-8 makes this guarantee. Of course, if you look at documents available on the internet, UTF-8 the encoding used by the vast majority of them (especially if you include seven-bit files, which can equally be considered ASCII, ISO-8859-x, and UTF-8), so while it might only be one encoding out of many, it's probably the most important :) In general, you can *only* make this parsing assumption IF you know for sure that your file's encoding is UTF-8, ISO-8859-x, some OEM eight-bit encoding (eg Windows-125x), or one of a handful of other compatible encodings. But it probably will be. > If not, and you have a huge file to reverse, it might be better to use a > temporary file. Yeah, or an in-memory deque if you know how many lines you want. Either way, you can read the file forwards, guaranteeing correct decoding even of a shifted character set (where a byte value can change in meaning based on arbitrarily distant context). ChrisA -- https://mail.python.org/mailman/listinfo/python-list