On Sun, Feb 5, 2012 at 18:49, Joey Hess <[email protected]> wrote: > John Millikin wrote: >> In GHC 7.2 and later, file path handling in the platform libraries >> was changed to treat all paths as text (encoded according to locale). >> This does not work well on POSIX systems, because POSIX paths are byte >> sequences. There is no guarantee that any particular path will be >> valid in the user's locale encoding. > > I've been dealing with this change too, but my current understanding > is that GHC's handling of encoding for FilePath is documented to allow > "arbitrary undecodable bytes to be round-tripped through it". > > As long as FilePaths are read using this file system encoding, any > FilePath should be usable even if it does not match the user's encoding.
That was my understanding also, then QuickCheck found a counter-example. It turns out that there are cases where a valid path cannot be roundtripped in the GHC 7.2 encoding. -------------------------------------------------------------------------- $ ~/ghc-7.0.4/bin/ghci Prelude> writeFile ".txt" "test" Prelude> readFile ".txt" "test" Prelude> $ ~/ghc-7.2.1/bin/ghci Prelude> import System.Directory Prelude System.Directory> getDirectoryContents "." ["\61347.txt","\61347.txt","..","."] Prelude System.Directory> readFile "\61347.txt" *** Exception: .txt: openFile: does not exist (No such file or directory) Prelude System.Directory> -------------------------------------------------------------------------- The issue is that [238,189,178] decodes to 0xEF72, which is within the 0xEF00-0xEFFF range that GHC uses to represent un-decodable bytes. > For FFI, anything that deals with a FilePath should use this > withFilePath, which GHC contains but doesn't export(?), rather than the > old withCString or withCAString: > > import GHC.IO.Encoding (getFileSystemEncoding) > import GHC.Foreign as GHC > > withFilePath :: FilePath -> (CString -> IO a) -> IO a > withFilePath fp f = getFileSystemEncoding >>= \enc -> GHC.withCString enc fp f If code uses either withFilePort or withCString, then the filenames written will depend on the user's locale. This is wrong. Filenames are either non-encoded text strings (Windows), UTF8 (OSX), or arbitrary bytes (non-OSX POSIX). They must not change depending on the locale. > Code that reads or writes a FilePath to a Handle (including even to > stdout!) must take care to set the right encoding too: > > fileEncoding :: Handle -> IO () > fileEncoding h = hSetEncoding h =<< getFileSystemEncoding This is also wrong. A "file path" cannot be written to a handle with any hope of correct behavior. If it's to be displayed to the user, a path should be converted to text first, then displayed. >> * system-filepath has been converted from GHC's escaping rules to its >> own, more compatible rules. This lets it support file paths that >> cannot be represented in GHC 7.2's escape format. > > I'm dobutful about adding yet another encoding to the mix. Things are > complicated enough already! And in my tests, GHC 7.4's FilePath encoding > does allow arbitrary bytes in FilePaths. Unlike the GHC encoding, this encoding is entirely internal, and should not change the API's behavior. > BTW, GHC now also has RawFilePath. Parts of System.Directory could be > usefully written to support that data type too. For example, the parent > directory can be determined. Other things are more difficult to do with > RawFilepath. This is new in 7.4, and won't be backported, right? I tried compiling the new "unix" package in 7.2 to get proper file path support, but it failed with an error about some new language extension. _______________________________________________ Haskell-Cafe mailing list [email protected] http://www.haskell.org/mailman/listinfo/haskell-cafe
