on 04.05.2006 14:57 Nick Coghlan said the following: > Mike Orr wrote: >> Intriguing idea, Noam, and excellent thinking. I'd say it's worth a >> separate PEP. It's too different to fit into PEP 355, and too big to >> be summarized in the "Open Issues" section. Of course, one PEP will >> be rejected if the other is approved. > > I agree that a competing PEP is probably the best way to track this idea.
<snip> >>> This means that path objects aren't the string representation of a >>> path; they are a ''logical'' representation of a path. Remember why a >>> filesystem path is called a path - because it's a way to get from one >>> place on the filesystem to another. Paths can be relative, which means >>> that they don't define from where to start the walk, and can be not >>> relative, which means that they do. In the tuple representation, >>> relative paths are simply tuples of strings, and not relative paths >>> are tuples of strings with a first "root" element. > > I suggest storing the first element separately from the rest of the path. The > reason for suggesting this is that you use 'os.sep' to separate elements in > the normal path, but *not* to separate the first element from the rest. I want to add that people might want to manipulate paths that are not for the currently running OS. Therefore I think the `sep` should be an attribute of the "root" element. For the same reason I'd like to add two values to the following list: > Possible values for the path's root element would then be: > > None ==> relative path (uses os.sep) + path.UNIXRELATIVE ==> uses '/' + path.WINDOWSRELATIVE ==> uses r'\' unconditionally > path.ROOT ==> Unix absolute path > path.DRIVECWD ==> Windows drive relative path > path.DRIVEROOT ==> Windows drive absolute path > path.UNCSHARE ==> UNC path > path.URL ==> URL path > > The last four would have attributes (the two Windows ones to get at the drive > letter, the UNC one to get at the share name, and the URL one to get at the > text of the URL). > > Similarly, I would separate out the extension to a distinct attribute, as it > too uses a different separator from the normal path elements ('.' most > places, > but '/' on RISC OS, for example) > > The string representation would then be: > > def __str__(self): > return (str(self.root) > + os.sep.join(self.path) > + os.extsep + self.ext) > >>> The advantage of using a logical representation is that you can forget >>> about the textual representation, which can be really complex. > > As noted earlier - text is a great format for path related I/O. It's a lousy > format for path manipulation. > >>> {{{ >>> p.normpath() -> Isn't needed - done by the constructor >>> p.basename() -> p[-1] >>> p.splitpath() -> (p[:-1], p[-1]) >>> p.splitunc() -> (p[0], p[1:]) (if isinstance(p[0], path.UNCRoot)) >>> p.splitall() -> Isn't needed >>> p.parent -> p[:-1] >>> p.name -> p[-1] >>> p.drive -> p[0] (if isinstance(p[0], path.Drive)) >>> p.uncshare -> p[0] (if isinstance(p[0], path.UNCRoot)) >>> }}} > > These same operations using separate root and path attributes: > > p.basename() -> p[-1] > p.splitpath() -> (p[:-1], p[-1]) > p.splitunc() -> (p.root, p.path) > p.splitall() -> Isn't needed > p.parent -> p[:-1] > p.name -> p[-1] > p.drive -> p.root.drive (AttributeError if not drive based) > p.uncshare -> p.root.share (AttributeError if not drive based) > >> That's a big drawback. PEP 355 can choose between string and >> non-string, but this way is limited to non-string. That raises the >> minor issue of changing the open() functions etc in the standard >> library, and the major issue of changing them in third-party >> libraries. > > It's not that big a drama, really. All you need to do is call str() on your > path objects when you're done manipulating them. The third party libraries > don't need to know how you created your paths, only what you ended up with. > > Alternatively, if the path elements are stored in separate attributes, > there's > nothing stopping the main object from inheriting from str or unicode the way > the PEP 355 path object does. > > Either way, this object would still be far more convenient for manipulating > paths than a string based representation that has to deal with OS-specific > issues on every operation, rather than only during creation and conversion to > a string. The path objects would also serve as an OS-independent > representation of filesystem paths. > > In fact, I'd leave most of the low-level API's working only on strings - the > only one I'd change to accept path objects directly is open() (which would be > fairly easy, as that's a factory function now). > >>> This means that paths starting with a drive letter alone >>> (!UnrootedDrive instance, in my module) and paths starting with a >>> backslash alone (the CURROOT object, in my module) are not relative >>> and not absolute. >> I guess that's plausable. We'll need feedback from Windows users. > > As suggested above, I think the root element should be stored separately from > the rest of the path. Then adding a new kind of root element (such as a URL) > becomes trivial. > >> The question is, does forcing people to use .stat() expose an >> implementation detail that should be hidden, and does it smell of >> Unixism? Most people think a file *is* a regular file or a directory. >> The fact that this is encoded in the file's permission bits -- which >> stat() examines -- is a quirk of Unix. > > I wouldn't expose stat() - as you say, it's a Unixism. Instead, I'd provide a > subclass of Path that used lstat instead of stat for symbolic links. > > So if I want symbolic links followed, I use the normal Path class. This class > just generally treat symbolic links as if they were the file pointed to > (except for the whole not recursing into symlinked subdirectories thing). > > The SymbolicPath subclass would treat normal files as usual, but *wouldn't* > follow symbolic links when stat'ting files (instead, it would stat the > symlink). > >>> == One Method for Finding Files == >>> >>> (They're actually two, but with exactly the same interface). The >>> original path object has these methods for finding files: >>> >>> {{{ >>> def listdir(self, pattern = None): ... >>> def dirs(self, pattern = None): ... >>> def files(self, pattern = None): ... >>> def walk(self, pattern = None): ... >>> def walkdirs(self, pattern = None): ... >>> def walkfiles(self, pattern = None): ... >>> def glob(self, pattern): >>> }}} >>> >>> I suggest one method that replaces all those: >>> {{{ >>> def glob(self, pattern='*', topdown=True, onlydirs=False, onlyfiles=False): >>> ... >>> }}} > > Swiss army methods are even more evil than wide APIs. And I consider the term > 'glob' itself to be a Unixism - I've found the technique to be far more > commonly known as wildcard matching in the Windows world. > > The path module has those methods for 7 distinct use cases: > - list the contents of this directory > - list the subdirectories of this directory > - list the files in this directory > - walk the directory tree rooted at this point, yielding both files and > dirs > - walk the directory tree rooted at this point, yielding only the dirs > - walk the directory tree rooted at this point, yielding only the files > - walk this pattern > > The first 3 operations are far more common than the last 4, so they need to > stay. > > def entries(self, pattern=None): > """Return list of all entries in directory""" > _path = type(self) > all_entries = os.listdir(str(self)) > if pattern is not None: > return [_path(x) for x in all_entries if x.matches(pattern)] > return [_path(x) for x in all_entries] > > def subdirs(self, pattern=None) > """Return list of all subdirectories in directory""" > return [x for x in self.entries(pattern) if x.is_dir()] > > def files(self, pattern=None) > """Return list of all files in directory""" > return [x for x in self.entries(pattern) if x.is_dir()] > > # here's sample implementations of the test methods used above > def matches(self, pattern): > return fnmatch.fnmatch(str(self), pattern) > def is_dir(self): > return os.isdir(str(self)) > def is_file(self): > return os.isfile(str(self)) > > For the tree traversal operations, there are still multiple use cases: > > def walk(self, topdown=True, onerror=None) > """ Walk directories and files just as os.walk does""" > # Similar to os.walk, only yielding Path objects instead of strings > # For each directory, effectively returns: > # yield dirpath, dirpath.subdirs(), dirpath.files() > > > def walkdirs(self, pattern=None, onerror=None) > """Only walk directories matching pattern""" > for dirpath, subdirs, files in self.walk(onerror=onerror): > yield dirpath > if pattern is not None: > # Modify in-place so that walk() responds to the change > subdirs[:] = [x for x in subdirs if x.matches(pattern)] > > def walkfiles(self, pattern=None, onerror=None) > """Only walk file names matching pattern""" > for dirpath, subdirs, files in self.walk(onerror=onerror): > if pattern is not None: > for f in files: > if f.match(pattern): > yield f > else: > for f in files: > yield f > > def walkpattern(self, pattern=None) > """Only walk paths matching glob pattern""" > _factory = type(self) > for pathname in glob.glob(pattern): > yield _factory(pathname) > > >>> pattern is the good old glob pattern, with one additional extension: >>> "**" matches any number of subdirectories, including 0. This means >>> that '**' means "all the files in a directory", '**/a' means "all the >>> files in a directory called a", and '**/a*/**/b*' means "all the files >>> in a directory whose name starts with 'b' and the name of one of their >>> parent directories starts with 'a'". >> I like the separate methods, but OK. I hope it doesn't *really* call >> glob if the pattern is the default. > > Keep the separate methods. Trying to squeeze too many disparate use cases > through a single API is a bad idea. Directory listing and tree-traversal are > not the same thing. Path matching and filename matching are not the same > thing > either. > >> Or one could, gasp, pass a constant or the 'find' command's >> abbreviation ("d" directory, "f" file, "s" socket, "b" block >> special...). > > Magic letters in an API are just as bad as magic numbers :) > > More importantly, these things don't port well between systems. > >>> In my proposal: >>> >>> {{{ >>> def copy(self, dst, copystat=False): ... >>> }}} >>> >>> It's just that I think that copyfile, copymode and copystat aren't >>> usually useful, and there's no reason not to unite copy and copy2. >> Sounds good. > > OK, this is one case where a swiss army method may make sense. Specifically, > something like: > > def copy_to(self, dest, copyfile=True, copymode=True, copytime=False) > > Whether or not to copy the file contents, the permission settings and the > last > access and modification time are then all independently selectable. > > The different method name also makes the direction of the copying clear (with > a bare 'copy', it's slightly ambiguous as the 'cp src dest' parallel isn't as > strong as it is with a function). > >> I was wondering what the fallout would be of normalizing "a/../b" and >> "a/./b" and "a//b", but it sounds like you're thinking about it. > > The latter two are OK, but normalizing the first one can give you the wrong > answer if 'a' is a symlink (since 'a/../b' is then not necessarily the same > as > 'b'). > > Better to just leave the '..' in and not treat it as something that can be > normalised away. > >>> I removed the methods associated with file extensions. I don't recall >>> using them, and since they're purely textual and not OS-dependent, I >>> think that you can always do p[-1].rsplit('.', 1). > > Most modern OS's use '.' as the extension separator, true, but os.extsep > still > exists for a reason :) > >> .namebase is an obnoxious name though. I wish we could come up with >> something better. > > p.path[-1] :) > > Then p.name can just do (p.path[-1] + os.extsep + p.ext) to rebuild the full > filename including the extension (if p.ext was None, then p.name would be the > same as p.path[-1]) > >>> I removed expand. There's no need to use normpath, so it's equivalent >>> to .expanduser().expandvars(), and I think that the explicit form is >>> better. >> Expand is useful though, so you don't forget one or the other. > > And as you'll usually want to do both, adding about 15 extra characters for > no > good reason seems like a bad idea. . . > >>> copytree - I removed it. In shutil it's documented as being mostly a >>> demonstration, and I'm not sure if it's really useful. >> Er, not sure I've used it, but it seems useful. Why force people to >> reinvent the wheel with their own recursive loops that they may get >> wrong? > > Because the handling of exceptional cases is almost always going to be > application specific. Note that even os.walk provides a callback hook for if > the call to os.listdir() fails when attempting to descend into a directory. > > For copytree, the issues to be considered are significantly worse: > - what to do if listdir fails in the source tree? > - what to do if reading a file fails in the source tree? > - what to do if a directory doesn't exist in the target tree? > - what to do if a directory already exists in the target tree? > - what to do if a file already exists in the target tree? > - what to do if writing a file fails in the target tree? > - should the file contents/mode/time be copied to the target tree? > - what to do with symlinks in the source tree? > > Now, what might potentially be genuinely useful is paired walk methods that > allowed the following: > > # Do path.walk over this directory, and also return the corresponding > # information for a destination directory (so the dest dir information > # probably *won't* match that file system > for src_info, dest_info in src_path.pairedwalk(dest_path): > src_dirpath, src_subdirs, src_files = src_info > dest_dirpath, dest_subdirs, dest_files = dest_info > # Do something useful > > # Ditto for path.walkdirs > for src_dirpath, dest_dirpath in src_path.pairedwalkdirs(dest_path): > # Do something useful > > # Ditto for path.walkfiles > for src_path, dest_path in src_path.pairedwalkfiles(dest_path): > src_path.copy_to(dest_path) > >> You've got two issues here. One is to go to a tuple base and replace >> several properties with slicing. The other is all your other proposed >> changes. Ideally the PEP would be written in a way that these other >> changes can be propagated back and forth between the PEPs as consensus >> builds. > > The main thing Jason's path object has going for it is that it brings > together > Python's disparate filesystem manipulation API's into one place. Using it is > a > definite improvement over using the standard lib directly. > > However, the choice to use a string as the internal storage instead a more > appropriate format (such as the three-piece structure I suggest of root, > path, > extension), it doesn't do as much as it could to abstract away the hassles of > os.sep and os.extsep. > > By focusing on the idea that strings are for path input and output > operations, > rather than for path manipulation, it should be possible to build something > even more usable than path.py > > If it was done within the next several months and released on PyPI, it might > even be a contender for 2.6. > > Cheers, > Nick. > _______________________________________________ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com