On Sun, 8 May 2022 at 20:31, Barry Scott <ba...@barrys-emacs.org> wrote: > > > On 8 May 2022, at 17:05, Marco Sulla <marco.sulla.pyt...@gmail.com> wrote: > > > > def tail(filepath, n=10, newline=None, encoding=None, chunk_size=100): > > n_chunk_size = n * chunk_size > > Why use tiny chunks? You can read 4KiB as fast as 100 bytes as its typically > the smaller size the file system will allocate. > I tend to read on multiple of MiB as its near instant.
Well, I tested on a little file, a list of my preferred pizzas, so.... > > pos = os.stat(filepath).st_size > > You cannot mix POSIX API with text mode. > pos is in bytes from the start of the file. > Textmode will be in code points. bytes != code points. > > > chunk_line_pos = -1 > > lines_not_found = n > > > > with open(filepath, newline=newline, encoding=encoding) as f: > > text = "" > > > > hard_mode = False > > > > if newline == None: > > newline = _lf > > elif newline == "": > > hard_mode = True > > > > if hard_mode: > > while pos != 0: > > pos -= n_chunk_size > > > > if pos < 0: > > pos = 0 > > > > f.seek(pos) > > In text mode you can only seek to a value return from f.tell() otherwise the > behaviour is undefined. Why? I don't see any recommendation about it in the docs: https://docs.python.org/3/library/io.html#io.IOBase.seek > > text = f.read() > > You have on limit on the amount of data read. I explained that previously. Anyway, chunk_size is small, so it's not a great problem. > > lf_after = False > > > > for i, char in enumerate(reversed(text)): > > Simple use text.rindex('\n') or text.rfind('\n') for speed. I can't use them when I have to find both \n or \r. So I preferred to simplify the code and use the for cycle every time. Take into mind anyway that this is a prototype for a Python C Api implementation (builtin I hope, or a C extension if not) > > Shortly, the file is always opened in text mode. File is read at the end in > > bigger and bigger chunks, until the file is finished or all the lines are > > found. > > It will fail if the contents is not ASCII. Why? > > Why? Because in encodings that have more than 1 byte per character, reading > > a chunk of n bytes, then reading the previous chunk, can eventually split > > the character between the chunks in two distinct bytes. > > No it cannot. text mode only knows how to return code points. Now if you are > in > binary it could be split, but you are not in binary mode so it cannot. >From the docs: seek(offset, whence=SEEK_SET) Change the stream position to the given byte offset. > > Do you think there are chances to get this function as a method of the file > > object in CPython? The method for a file object opened in bytes mode is > > simpler, since there's no encoding and newline is only \n in that case. > > State your requirements. Then see if your implementation meets them. The method should return the last n lines from a file object. If the file object is in text mode, the newline parameter must be honored. If the file object is in binary mode, a newline is always b"\n", to be consistent with readline. I suppose the current implementation of tail satisfies the requirements for text mode. The previous one satisfied binary mode. Anyway, apart from my implementation, I'm curious if you think a tail method is worth it to be a method of the builtin file objects in CPython. -- https://mail.python.org/mailman/listinfo/python-list