[Python-Dev] Re: os.scandir bug in Windows?

Eryk Sun Tue, 20 Oct 2020 07:07:13 -0700

On 10/19/20, Greg Ewing <greg.ew...@canterbury.ac.nz> wrote:
> On 20/10/20 4:52 am, Gregory P. Smith wrote:
>> Those of us with a traditional posix filesystem background may raise
>> eyeballs at this duplication, seeing a directory as a place that merely
>> maps names to inodes
>
> This is probably a holdover from MS-DOS, where there was no separate
> inode-like structure -- it was all in the directory entry.

DOS implemented a find-first/find-next API (int 21h 4E/4F) that
provided a file's name, attributes, size, and last write time/date. I
think it's clear that the design was influenced by the
readily-available contents of a FAT dirent. The Win32 API extended
this to FindFirstFile/FindNextFile, with added support for the long
filename, create and access times, and, in NT 5+, the reparse tag for
a reparse point.

NTFS had to support this metadata in the directory index, else
FindFirstFile/FindNextFile would be too expensive if the filesystem
had to fetch the metadata from the MFT for every matching file in a
listing. It tries to keep the duplicated metadata in sync -- such as
when a file is open, closed, manually extended in size, when the cache
is flushed, or when metadata is explicitly set (e.g.
SetFileInformationByHandle: FileBasicInfo). But for performance it
doesn't update the duplicated data every time a file is read from or
written to. And, in particular, if it's just the access time that
changed, it updates the duplicated access time with a one-hour
granularity. (There's also a registry value, as I mentioned
previously, that disables updating access times completely -- in both
the MFT record and the directory index.)

That said, if a file has multiple hardlinks the current NTFS
implementation for updating duplicated data is totally unreliable. It
only updates the accessed link. All other links go stale. We don't
have any reasonable way to special case this situation because the
directory entry doesn't include the number of links a file has. It has
to be opened and queried directly, but then one might as well do a
full stat() for every file.

I recommend relying on only the high-level is_dir(), is_file(), and
is_symlink() methods of os.scandir() items, to quickly process a
directory. inode() is reliable -- as much as is possible in Windows --
because the implementation gets the full stat info, but check to
ensure it's not 0. It's based on the file ID, which Windows
filesystems aren't required to support (or reliably support; it's not
stable in FAT). NTFS and ReFS support reliable 64-bit file IDs, and
opening by file ID.
_______________________________________________
Python-Dev mailing list -- python-dev@python.org
To unsubscribe send an email to python-dev-le...@python.org
https://mail.python.org/mailman3/lists/python-dev.python.org/
Message archived at
https://mail.python.org/archives/list/python-dev@python.org/message/JKK47AWKUOWPPBEAIRGIFRMW6FCPZILG/
Code of Conduct: http://python.org/psf/codeofconduct/

[Python-Dev] Re: os.scandir bug in Windows?

Reply via email to