Re: behaviour change in getDirectoryContents in GHC 7.2?
On Thu, Nov 10, 2011 at 03:28, Simon Marlow marlo...@gmail.com wrote: I've done a search/replace and called it RawFilePath. Ok? Fantastic, thank you very much. ___ Glasgow-haskell-users mailing list Glasgow-haskell-users@haskell.org http://www.haskell.org/mailman/listinfo/glasgow-haskell-users
Re: behaviour change in getDirectoryContents in GHC 7.2?
On Wed, Nov 9, 2011 at 08:04, Simon Marlow marlo...@gmail.com wrote: Ok, I spent most of today adding ByteString alternatives for all of the functions in System.Posix that use FilePath or environment strings. The Haddocks for my augmented unix package are here: http://community.haskell.org/~simonmar/unix-with-bytestring-extras/index.html In particular, the module System.Posix.ByteString is the whole System.Posix API but with ByteString FilePaths and environment strings: http://community.haskell.org/~simonmar/unix-with-bytestring-extras/System-Posix-ByteString.html This looks lovely -- thank you. Once it's released, I'll port all my libraries over to using it. It has one addition relative to System.Posix: getArgs :: IO [ByteString] Thank you very much! Several tools I use daily accept binary data as command-line options, and this will make it much easier to port them to Haskell in the future. Let me know what you think. I suspect the main controversial aspect is that I included type FilePath = ByteString which is a bit cute but might be confusing. Indeed, I was very confused when I saw that in the docs. If it's not too much trouble, could those functions accept/return ByteString directly? ___ Glasgow-haskell-users mailing list Glasgow-haskell-users@haskell.org http://www.haskell.org/mailman/listinfo/glasgow-haskell-users
Re: behaviour change in getDirectoryContents in GHC 7.2?
On Tue, Nov 8, 2011 at 03:04, Simon Marlow marlo...@gmail.com wrote: As mentioned earlier in the thread, this behavior is breaking things. Due to an implementation error, programs compiled with GHC 7.2 on POSIX systems cannot open files unless their paths also happen to be valid text according to their locale. It is very difficult to work around this error, because the paths-are-text logic was placed at a very low level in the library stack. So your objection is that there is a bug? What if we fixed the bug? My objection is that the current implementation provides no way to work around potential bugs. GHC is software. Like all software, it contains errors, and new features are likely to contain more errors. When adding behavior like automatic path encoding, there should always be a way to avoid or work around it, in case a severe bug is discovered. It would probably be better to have an abstract FilePath type and to keep the original bytes, decoding on demand. But that is a big change to the API and would break much more code. One day we'll do this properly; for now we have this, which I think is a pretty reasonble compromise. Please understand, I am not arguing against the existence of this encoding layer in general. It's a fine idea for a simplistic high-level filesystem interaction library. But it should be *optional*, not part of the compiler or base. Ok, so I was about to reply and say that the low-level API is available via the unix and Win32 packages, and then I thought I should check first, and I discovered that even using System.Posix you get the magic encoding behaviour. I really think we should provide the native APIs. The problem is that the System.Posix.Directory API is all in terms of FilePath (=String), and if we gave that a different meaning from the System.Directory FilePaths then confusion would ensue. So perhaps we need to add another API to System.Posix with filesystem operations in terms of ByteString, and similarly for Win32. +1 I think most users would be OK with having System.Posix treat FilePath differently, as long as this is clearly documented, but if you feel a separate API is better then I have no objection. As long as there's some way to say I know what I'm doing, here's the bytes to the library. The Win32 package uses wide-character functions, so I'm not sure whether bytes would be appropriate there. My instinct says to stick with chars, via withCWString or equivalent. The package maintainer will have a better idea of what fits with the OS's idioms. ___ Glasgow-haskell-users mailing list Glasgow-haskell-users@haskell.org http://www.haskell.org/mailman/listinfo/glasgow-haskell-users
Re: behaviour change in getDirectoryContents in GHC 7.2?
On Mon, Nov 7, 2011 at 09:02, Simon Marlow marlo...@gmail.com wrote: I think you might be misunderstanding how the new API works. Basically, imagine a reversible transformation: encode :: String - [Word8] decode :: [Word8] - String this transformation is applied in the appropriate direction by the IO library to translate filesystem paths into FilePath and vice versa. No information is lost; furthermore you can apply the transformation yourself in order to recover the original [Word8] from a String, or to inject your own [Word8] file path. Ok? I understand how the API is intended / designed to work; however, the implementation does not actually do this. My argument is that this transformation should be in a high-level library like directory, and the low-level libraries like base or unix ought to provide functions which do not transform their inputs. That way, when an error is found in the encoding logic, it can be fixed by just pushing a new version of the affected library to Hackage, instead of requiring a new version of the compiler. I am also not convinced that it is possible to correctly implement either of these functions if their behavior is dependent on the user's locale. All this does is mean that the common case where you want to interpret file system paths as text works with no fuss, without breaking anything in the case when the file system paths are not actually text. As mentioned earlier in the thread, this behavior is breaking things. Due to an implementation error, programs compiled with GHC 7.2 on POSIX systems cannot open files unless their paths also happen to be valid text according to their locale. It is very difficult to work around this error, because the paths-are-text logic was placed at a very low level in the library stack. It would probably be better to have an abstract FilePath type and to keep the original bytes, decoding on demand. But that is a big change to the API and would break much more code. One day we'll do this properly; for now we have this, which I think is a pretty reasonble compromise. Please understand, I am not arguing against the existence of this encoding layer in general. It's a fine idea for a simplistic high-level filesystem interaction library. But it should be *optional*, not part of the compiler or base. As implemented in GHC 7.2, this encoding is a complex and untested behavior with no escape hatch. ___ Glasgow-haskell-users mailing list Glasgow-haskell-users@haskell.org http://www.haskell.org/mailman/listinfo/glasgow-haskell-users
Re: behaviour change in getDirectoryContents in GHC 7.2?
On Mon, Nov 7, 2011 at 15:39, Yitzchak Gale g...@sefer.org wrote: The problem is that Haskell 98 specifies type FilePath = String. In retrospect, we now know that this is too simplistic. But that's what we have right now. This is *a* problem, but not a particularly major one; the definition of paths in GHC 7.0 (text on some systems, bytes on others) is inelegant but workable. The main problem, IMO, is that the semantics of openFile et al changed in a way that is impossible to check for statically, and there was no mention of this in the documentation. It's one thing to make a change which will cause new compilation failures. It's quite another to introduce an undocumented change in important semantics. As implemented in GHC 7.2, this encoding is a complex and untested behavior with no escape hatch. Isn't System.Posix.IO the escape hatch? Even though FilePath is still used there instead of ByteString as it should be, this is the low-level POSIX-specific library. So the old hack of interpreting the lowest 8 bits as bytes makes a lot more sense there. System.Posix.IO, and the unix package in general, also perform the new path encoding/decoding. ___ Glasgow-haskell-users mailing list Glasgow-haskell-users@haskell.org http://www.haskell.org/mailman/listinfo/glasgow-haskell-users
Re: behaviour change in getDirectoryContents in GHC 7.2?
2011/11/6 Max Bolingbroke batterseapo...@hotmail.com: On 6 November 2011 04:14, John Millikin jmilli...@gmail.com wrote: For what it's worth, on my Ubuntu system, Nautilus ignores the locale and just treats all paths as either UTF8 or invalid. To me, this seems like the most reasonable option; the concept of locale encoding is entirely vestigal, and should only be used in certain specialized cases. Unfortunately non-UTF8 locale encodings are seen in practice quite often. I'm not sure about Linux, but certainly lots of Windows systems are configured with a locale encoding like GBK or Big5. This doesn't really matter for file paths, though. The Win32 file API uses wide-character functions, which ought to work with Unicode text regardless of what the user set their locale to. Paths as text is what *Windows* programmers expect. Paths as bytes is what's expected by programmers on non-Windows OSes, including Linux and OS X. IIRC paths on OS X are guaranteed to be valid UTF-8. The only platform that uses bytes for paths (that we care about) is Linux. UTF-8 is bytes. It can be treated as text in some cases, but it's better to think about it as bytes. I'm not saying one is inherently better than the other, but considering that various UNIX and UNIX-like operating systems have been using byte-based paths for near on forty years now, trying to abolish them by redefining the type is not a useful action. We have to: 1. Provide an API that makes sense on all our supported OSes 2. Have getArgs :: IO [String] 3. Have it such that if you go to your console and write (./MyHaskellProgram 你好) then getArgs tells you [你好] Given these constraints I don't see any alternative to PEP-383 behaviour. Requirement #1 directly contradicts #2 and #3. If you're going to make all the System.IO stuff use text, at least give us an escape hatch. The unix package is ideally suited, as it's already inherently OS-specific. Something like this would be perfect: You can already do this with the implemented design. We have: openFile :: FilePath - IO Handle The FilePath will be encoded in the fileSystemEncoding. On Unix this will have PEP383 roundtripping behaviour. So if you want openFile' :: [Byte] - IO Handle you can write something like this: escape = map (\b - if b 128 then chr b else chr (0xEF00 + b)) openFile = openFile' . escape The bytes that reach the API call will be exactly the ones you supply. (You can also implement escape by just encoding the [Byte] with the fileSystemEncoding). Likewise, if you have a String and want to get the [Byte] we decoded it from, you just need to encode the String again with the fileSystemEncoding. If this is not enough for you please let me know, but it seems to me that it covers all your use cases, without any need to reimplement the FFI bindings. This is not enough, since these strings are still being passed through the potentially (and in 7.2.1, actually) broken path encoder. If the unix package had defined functions which operate on the correct type (CString / ByteString), then it would not be necessary to patch base. I could just call the POSIX functions from system-fileio and be done with it. And this solution still assumes that there is such a thing as a filesystem encoding in POSIX. There isn't. A file path is an arbitrary sequence of bytes, with no significance except what the application user interface decides. It seems to me that there's two ways to provide bindings to operating system functionality. One is to give low-level access, using abstractions as close to the real API as possible. In this model, unix would provide functions like [[ rename :: ByteString - ByteString - IO () ]], and I would know that it's not going to do anything weird to the parameters. Another is to pretend that operating systems are all the same, and can have their APIs abstracted away to some hypothetical virtual system. This model just makes it more difficult for programmers to access the OS, as they have to learn both the standard API, *and* whatever weird thing has been layered on top of it. ___ Glasgow-haskell-users mailing list Glasgow-haskell-users@haskell.org http://www.haskell.org/mailman/listinfo/glasgow-haskell-users
Re: behaviour change in getDirectoryContents in GHC 7.2?
FYI: I just released new versions of system-filepath and system-fileio, which attempt to work around the changes in GHC 7.2. On Wed, Nov 2, 2011 at 11:55, Max Bolingbroke batterseapo...@hotmail.com wrote: Maybe I'm misunderstanding, but it sounds like you're still trying to treat posix file paths as text. There should not be any iconv or locales or anything involved in looking up a posix file path. The thing is that on every non-Unix OS paths *can* be interpreted as text, and people expect them to be. In fact, even on Unix most programs/frameworks interpret them as text - e.g. IIRC QT's QString class is used for filenames in that framework, and if you view filenames in an end-user app like Nautilus it obviously decodes them in the current locale for presentation. There is a difference between how paths are rendered to users, and how they are handled by applications. Applications *must* use whatever the operating system says a path is. If a path is bytes, they must use bytes. If a path is text, they must use text. How they present paths to the user is a matter of user interface design. For what it's worth, on my Ubuntu system, Nautilus ignores the locale and just treats all paths as either UTF8 or invalid. To me, this seems like the most reasonable option; the concept of locale encoding is entirely vestigal, and should only be used in certain specialized cases. Paths as text is just what people expect, and is grandfathered into the Haskell libraries itself as type FilePath = String. PEP-383 behaviour is (I think) a good way to satisfy this expectation while still not sacrificing the ability to deal with files that have names encoded in some way other than the locale encoding. Paths as text is what *Windows* programmers expect. Paths as bytes is what's expected by programmers on non-Windows OSes, including Linux and OS X. I'm not saying one is inherently better than the other, but considering that various UNIX and UNIX-like operating systems have been using byte-based paths for near on forty years now, trying to abolish them by redefining the type is not a useful action. (Perhaps if Haskell had an abstract FilePath data type rather than FilePath = String we could do something different. This is the general purpose of my system-filepath package, which provides a set of generic modifications, applicable to paths from various OS families. But it's not clear if we could, without also having ugliness like getArgs :: IO [Byte]) We *ought* to have getArgs :: IO [ByteString], at least on POSIX systems. It's totally OK if high-level packages like directory want to hide details behind some nice abstractions. But the low-level libraries, like base and unix and Win32, must must must provide direct low-level access to the operating system's APIs. The only other option is to re-implement half of the standard library using FFI bindings, which is ugly (for file/directory manipulation) or nearly impossible (for opening handles). If you're going to make all the System.IO stuff use text, at least give us an escape hatch. The unix package is ideally suited, as it's already inherently OS-specific. Something like this would be perfect: -- System.Posix.File.openHandle :: CString - IOMode - IO Handle System.Posix.File.rename :: CString - CString - IO () -- ___ Glasgow-haskell-users mailing list Glasgow-haskell-users@haskell.org http://www.haskell.org/mailman/listinfo/glasgow-haskell-users
Re: behaviour change in getDirectoryContents in GHC 7.2?
On Wed, Nov 2, 2011 at 06:53, Max Bolingbroke batterseapo...@hotmail.com wrote: I've got a patch that will work around the issue in most situations by avoiding the iconv code path. With the patch everything will work OK as long as the system locale is one that we have a native-Haskell decoder for (i.e. basically UTF-8). So you will still be able to get the broken behaviour if the above 3 conditions are met AND your system locale is not UTF-8. What package does this patch -- unix, directory, something else? I think the only way to fix this last case in general is to fix iconv itself, so I'm going to see if I can get a patch upstream. Fixing it for people with UTF-8 locales should be enough for 99% of users, though. Maybe I'm misunderstanding, but it sounds like you're still trying to treat posix file paths as text. There should not be any iconv or locales or anything involved in looking up a posix file path. ___ Glasgow-haskell-users mailing list Glasgow-haskell-users@haskell.org http://www.haskell.org/mailman/listinfo/glasgow-haskell-users
Re: behaviour change in getDirectoryContents in GHC 7.2?
You're right -- many parts of system-fileio (the parts based on directory) are broken due to this. I'll need to update it to call the posix/win32 functions directly. IMO, the GHC behavior in =7.0 is ugly, but the behavior in 7.2 is fundamentally wrong. Different OSes have different definitions of a file path. A Windows path is a sequence of Unicode characters. A Linux/BSD path is a sequence of bytes. I'm not certain what OSX does, but I believe it uses bytes also. In GHC = 7.0, the String type was used for both sorts of paths, with interpretation of the contents being OS-dependent. This sort of works, because it's possible to represent both byte- and text-based paths in String. GHC 7.2 assumes Linux/BSD paths are text, which 1) silently breaks all existing code and 2) makes it impossible to fix within the given API. On Tue, Nov 1, 2011 at 08:48, Felipe Almeida Lessa felipe.le...@gmail.com wrote: On Tue, Nov 1, 2011 at 5:16 AM, Ganesh Sittampalam gan...@earth.li wrote: I'm just investigating what we can do about a problem with darcs' handling of non-ASCII filenames on GHC 7.2. The issue is apparently that as of GHC 7.2, getDirectoryContents now tries to decode filenames in the current locale, rather than converting a stream of bytes into characters: http://bugs.darcs.net/issue2095 I found an old thread on the subject: http://www.haskell.org/pipermail/haskell-cafe/2009-June/062795.html and some GHC tickets (e.g. http://hackage.haskell.org/trac/ghc/ticket/3300) Can anyone point me at the rationale and details of the change and/or suggest workarounds? You could try using system-fileio [1], but by reading its source code I guess that it may have the same bug (since it tries to decode what the directory package gives). I'm CCing John Millikin, its maintainer. Cheers, [1] http://hackage.haskell.org/packages/archive/system-fileio/0.3.2.1/doc/html/Filesystem.html#v:listDirectory -- Felipe. ___ Glasgow-haskell-users mailing list Glasgow-haskell-users@haskell.org http://www.haskell.org/mailman/listinfo/glasgow-haskell-users
Re: behaviour change in getDirectoryContents in GHC 7.2?
On Tue, Nov 1, 2011 at 11:43, Max Bolingbroke batterseapo...@hotmail.com wrote: Hi John, On 1 November 2011 17:14, John Millikin jmilli...@gmail.com wrote: GHC 7.2 assumes Linux/BSD paths are text, which 1) silently breaks all existing code and 2) makes it impossible to fix within the given API. Please can you give an example of code that is broken with the new behaviour? The PEP 383 mechanism will unavoidably break *some* code but I don't expect there to be much of it. One thing that most likely *will* be broken is code that attempts to reinterpret a String as a byte string - i.e. assuming that it was decoded using latin1, but I expect that such code can just be deleted when you upgrade to 7.2. Examples of broken code are Darcs, my system-fileio, and likely anything else which needs to open Unicode-named files in both 7.0 and 7.2. As a quick example, consider the case of files with encodings different from the user's locale. This is *very* common, especially when interoperating with foreign Windows systems. $ ghci-7.0.4 GHC import System.Directory GHC createDirectory path-test GHC writeFile path-test/\xA1\xA5 hello\n GHC writeFile path-test/\xC2\xA1\xC2\xA5 world\n GHC ^D $ ghci-7.2.1 GHC import System.Directory GHC getDirectoryContents path-test [\161\165,\61345\61349,..,.] GHC readFile path-test/\161\165 world\n GHC readFile path-test/\61345\61349 *** Exception: path-test/: openFile: does not exist (No such file or directory) As I pointed out earlier in the thread you can recover the old behaviour if you really want it by manually reencoding the strings, so I would dispute the claim that it is impossible to fix within the given API. Please describe how I can, in GHC 7.2, read the contents of the file path-test/\xA1\xA5 without changing my locale. As far as I can tell, there is no way to do this using the standard libraries. I would have to fall back to the unix package, or even FFI imports, to open that file. ___ Glasgow-haskell-users mailing list Glasgow-haskell-users@haskell.org http://www.haskell.org/mailman/listinfo/glasgow-haskell-users