Re: behaviour change in getDirectoryContents in GHC 7.2?

2011-11-10 Thread John Millikin
On Thu, Nov 10, 2011 at 03:28, Simon Marlow marlo...@gmail.com wrote:
 I've done a search/replace and called it RawFilePath.  Ok?

Fantastic, thank you very much.

___
Glasgow-haskell-users mailing list
Glasgow-haskell-users@haskell.org
http://www.haskell.org/mailman/listinfo/glasgow-haskell-users


Re: behaviour change in getDirectoryContents in GHC 7.2?

2011-11-09 Thread John Millikin
On Wed, Nov 9, 2011 at 08:04, Simon Marlow marlo...@gmail.com wrote:
 Ok, I spent most of today adding ByteString alternatives for all of the
 functions in System.Posix that use FilePath or environment strings.  The
 Haddocks for my augmented unix package are here:

 http://community.haskell.org/~simonmar/unix-with-bytestring-extras/index.html

 In particular, the module System.Posix.ByteString is the whole System.Posix
 API but with ByteString FilePaths and environment strings:

 http://community.haskell.org/~simonmar/unix-with-bytestring-extras/System-Posix-ByteString.html

This looks lovely -- thank you.

Once it's released, I'll port all my libraries over to using it.

 It has one addition relative to System.Posix:

  getArgs :: IO [ByteString]

Thank you very much! Several tools I use daily accept binary data as
command-line options, and this will make it much easier to port them
to Haskell in the future.

 Let me know what you think.  I suspect the main controversial aspect is that
 I included

  type FilePath = ByteString

 which is a bit cute but might be confusing.

Indeed, I was very confused when I saw that in the docs. If it's not
too much trouble, could those functions accept/return ByteString
directly?

___
Glasgow-haskell-users mailing list
Glasgow-haskell-users@haskell.org
http://www.haskell.org/mailman/listinfo/glasgow-haskell-users


Re: behaviour change in getDirectoryContents in GHC 7.2?

2011-11-08 Thread John Millikin
On Tue, Nov 8, 2011 at 03:04, Simon Marlow marlo...@gmail.com wrote:
 As mentioned earlier in the thread, this behavior is breaking things.
 Due to an implementation error, programs compiled with GHC 7.2 on
 POSIX systems cannot open files unless their paths also happen to be
 valid text according to their locale. It is very difficult to work
 around this error, because the paths-are-text logic was placed at a
 very low level in the library stack.

 So your objection is that there is a bug?  What if we fixed the bug?

My objection is that the current implementation provides no way to
work around potential bugs.

GHC is software. Like all software, it contains errors, and new
features are likely to contain more errors. When adding behavior like
automatic path encoding, there should always be a way to avoid or work
around it, in case a severe bug is discovered.

 It would probably be better to have an abstract FilePath type and to keep
 the original bytes, decoding on demand.  But that is a big change to the
 API
 and would break much more code.  One day we'll do this properly; for now
 we
 have this, which I think is a pretty reasonble compromise.

 Please understand, I am not arguing against the existence of this
 encoding layer in general. It's a fine idea for a simplistic
 high-level filesystem interaction library. But it should be
 *optional*, not part of the compiler or base.

 Ok, so I was about to reply and say that the low-level API is available via
 the unix and Win32 packages, and then I thought I should check first, and I
 discovered that even using System.Posix you get the magic encoding
 behaviour.

 I really think we should provide the native APIs.  The problem is that the
 System.Posix.Directory API is all in terms of FilePath (=String), and if we
 gave that a different meaning from the System.Directory FilePaths then
 confusion would ensue.  So perhaps we need to add another API to
 System.Posix with filesystem operations in terms of ByteString, and
 similarly for Win32.

+1

I think most users would be OK with having System.Posix treat FilePath
differently, as long as this is clearly documented, but if you feel a
separate API is better then I have no objection. As long as there's
some way to say I know what I'm doing, here's the bytes to the
library.

The Win32 package uses wide-character functions, so I'm not sure
whether bytes would be appropriate there. My instinct says to stick
with chars, via withCWString or equivalent. The package maintainer
will have a better idea of what fits with the OS's idioms.

___
Glasgow-haskell-users mailing list
Glasgow-haskell-users@haskell.org
http://www.haskell.org/mailman/listinfo/glasgow-haskell-users


Re: behaviour change in getDirectoryContents in GHC 7.2?

2011-11-07 Thread John Millikin
On Mon, Nov 7, 2011 at 09:02, Simon Marlow marlo...@gmail.com wrote:
 I think you might be misunderstanding how the new API works.  Basically,
 imagine a reversible transformation:

  encode :: String - [Word8]
  decode :: [Word8] - String

 this transformation is applied in the appropriate direction by the IO
 library to translate filesystem paths into FilePath and vice versa.  No
 information is lost; furthermore you can apply the transformation yourself
 in order to recover the original [Word8] from a String, or to inject your
 own [Word8] file path.

 Ok?

I understand how the API is intended / designed to work; however, the
implementation does not actually do this. My argument is that this
transformation should be in a high-level library like directory, and
the low-level libraries like base or unix ought to provide
functions which do not transform their inputs. That way, when an error
is found in the encoding logic, it can be fixed by just pushing a new
version of the affected library to Hackage, instead of requiring a new
version of the compiler.

I am also not convinced that it is possible to correctly implement
either of these functions if their behavior is dependent on the user's
locale.

 All this does is mean that the common case where you want to interpret file
 system paths as text works with no fuss, without breaking anything in the
 case when the file system paths are not actually text.

As mentioned earlier in the thread, this behavior is breaking things.
Due to an implementation error, programs compiled with GHC 7.2 on
POSIX systems cannot open files unless their paths also happen to be
valid text according to their locale. It is very difficult to work
around this error, because the paths-are-text logic was placed at a
very low level in the library stack.

 It would probably be better to have an abstract FilePath type and to keep
 the original bytes, decoding on demand.  But that is a big change to the API
 and would break much more code.  One day we'll do this properly; for now we
 have this, which I think is a pretty reasonble compromise.

Please understand, I am not arguing against the existence of this
encoding layer in general. It's a fine idea for a simplistic
high-level filesystem interaction library. But it should be
*optional*, not part of the compiler or base.

As implemented in GHC 7.2, this encoding is a complex and untested
behavior with no escape hatch.

___
Glasgow-haskell-users mailing list
Glasgow-haskell-users@haskell.org
http://www.haskell.org/mailman/listinfo/glasgow-haskell-users


Re: behaviour change in getDirectoryContents in GHC 7.2?

2011-11-07 Thread John Millikin
On Mon, Nov 7, 2011 at 15:39, Yitzchak Gale g...@sefer.org wrote:
 The problem is that Haskell 98 specifies type FilePath = String.
 In retrospect, we now know that this is too simplistic.
 But that's what we have right now.

This is *a* problem, but not a particularly major one; the definition
of paths in GHC 7.0 (text on some systems, bytes on others) is
inelegant but workable.

The main problem, IMO, is that the semantics of openFile et al changed
in a way that is impossible to check for statically, and there was no
mention of this in the documentation. It's one thing to make a change
which will cause new compilation failures. It's quite another to
introduce an undocumented change in important semantics.

 As implemented in GHC 7.2, this encoding is a complex and untested
 behavior with no escape hatch.

 Isn't System.Posix.IO the escape hatch?

 Even though FilePath is still used there instead of
 ByteString as it should be, this is the
 low-level POSIX-specific library. So the old hack of
 interpreting the lowest 8 bits as bytes makes
 a lot more sense there.

System.Posix.IO, and the unix package in general, also perform the
new path encoding/decoding.

___
Glasgow-haskell-users mailing list
Glasgow-haskell-users@haskell.org
http://www.haskell.org/mailman/listinfo/glasgow-haskell-users


Re: behaviour change in getDirectoryContents in GHC 7.2?

2011-11-06 Thread John Millikin
2011/11/6 Max Bolingbroke batterseapo...@hotmail.com:
 On 6 November 2011 04:14, John Millikin jmilli...@gmail.com wrote:
 For what it's worth, on my Ubuntu system, Nautilus ignores the locale
 and just treats all paths as either UTF8 or invalid.
 To me, this seems like the most reasonable option; the concept of
 locale encoding is entirely vestigal, and should only be used in
 certain specialized cases.

 Unfortunately non-UTF8 locale encodings are seen in practice quite
 often. I'm not sure about Linux, but certainly lots of Windows systems
 are configured with a locale encoding like GBK or Big5.

This doesn't really matter for file paths, though. The Win32 file API
uses wide-character functions, which ought to work with Unicode text
regardless of what the user set their locale to.

 Paths as text is what *Windows* programmers expect. Paths as bytes is
 what's expected by programmers on non-Windows OSes, including Linux
 and OS X.

 IIRC paths on OS X are guaranteed to be valid UTF-8. The only platform
 that uses bytes for paths (that we care about) is Linux.

UTF-8 is bytes. It can be treated as text in some cases, but it's
better to think about it as bytes.

 I'm not saying one is inherently better than the other, but
 considering that various UNIX  and UNIX-like operating systems have
 been using byte-based paths for near on forty years now, trying to
 abolish them by redefining the type is not a useful action.

 We have to:
  1. Provide an API that makes sense on all our supported OSes
  2. Have getArgs :: IO [String]
  3. Have it such that if you go to your console and write
 (./MyHaskellProgram 你好) then getArgs tells you [你好]

 Given these constraints I don't see any alternative to PEP-383 behaviour.

Requirement #1 directly contradicts #2 and #3.

 If you're going to make all the System.IO stuff use text, at least
 give us an escape hatch. The unix package is ideally suited, as it's
 already inherently OS-specific. Something like this would be perfect:

 You can already do this with the implemented design. We have:

 openFile :: FilePath - IO Handle

 The FilePath will be encoded in the fileSystemEncoding. On Unix this
 will have PEP383 roundtripping behaviour. So if you want openFile' ::
 [Byte] - IO Handle you can write something like this:

 escape = map (\b - if b  128 then chr b else chr (0xEF00 + b))
 openFile = openFile' . escape

 The bytes that reach the API call will be exactly the ones you supply.
 (You can also implement escape by just encoding the [Byte] with the
 fileSystemEncoding).

 Likewise, if you have a String and want to get the [Byte] we decoded
 it from, you just need to encode the String again with the
 fileSystemEncoding.

 If this is not enough for you please let me know, but it seems to me
 that it covers all your use cases, without any need to reimplement the
 FFI bindings.

This is not enough, since these strings are still being passed through
the potentially (and in 7.2.1, actually) broken path encoder.

If the unix package had defined functions which operate on the
correct type (CString / ByteString), then it would not be necessary to
patch base. I could just call the POSIX functions from system-fileio
and be done with it.

And this solution still assumes that there is such a thing as a
filesystem encoding in POSIX. There isn't. A file path is an arbitrary
sequence of bytes, with no significance except what the application
user interface decides.

It seems to me that there's two ways to provide bindings to operating
system functionality.

One is to give low-level access, using abstractions as close to the
real API as possible. In this model, unix would provide functions
like [[ rename :: ByteString - ByteString - IO () ]], and I would
know that it's not going to do anything weird to the parameters.

Another is to pretend that operating systems are all the same, and can
have their APIs abstracted away to some hypothetical virtual system.
This model just makes it more difficult for programmers to access the
OS, as they have to learn both the standard API, *and* whatever weird
thing has been layered on top of it.

___
Glasgow-haskell-users mailing list
Glasgow-haskell-users@haskell.org
http://www.haskell.org/mailman/listinfo/glasgow-haskell-users


Re: behaviour change in getDirectoryContents in GHC 7.2?

2011-11-05 Thread John Millikin
FYI: I just released new versions of system-filepath and
system-fileio, which attempt to work around the changes in GHC 7.2.

On Wed, Nov 2, 2011 at 11:55, Max Bolingbroke
batterseapo...@hotmail.com wrote:
 Maybe I'm misunderstanding, but it sounds like you're still trying to
 treat posix file paths as text. There should not be any iconv or
 locales or anything involved in looking up a posix file path.

 The thing is that on every non-Unix OS paths *can* be interpreted as
 text, and people expect them to be. In fact, even on Unix most
 programs/frameworks interpret them as text - e.g. IIRC QT's QString
 class is used for filenames in that framework, and if you view
 filenames in an end-user app like Nautilus it obviously decodes them
 in the current locale for presentation.

There is a difference between how paths are rendered to users, and how
they are handled by applications.

Applications *must* use whatever the operating system says a path is.
If a path is bytes, they must use bytes. If a path is text, they must
use text.

How they present paths to the user is a matter of user interface design.

For what it's worth, on my Ubuntu system, Nautilus ignores the locale
and just treats all paths as either UTF8 or invalid.
To me, this seems like the most reasonable option; the concept of
locale encoding is entirely vestigal, and should only be used in
certain specialized cases.

 Paths as text is just what people expect, and is grandfathered into
 the Haskell libraries itself as type FilePath = String. PEP-383
 behaviour is (I think) a good way to satisfy this expectation while
 still not sacrificing the ability to deal with files that have names
 encoded in some way other than the locale encoding.

Paths as text is what *Windows* programmers expect. Paths as bytes is
what's expected by programmers on non-Windows OSes, including Linux
and OS X.

I'm not saying one is inherently better than the other, but
considering that various UNIX  and UNIX-like operating systems have
been using byte-based paths for near on forty years now, trying to
abolish them by redefining the type is not a useful action.

 (Perhaps if Haskell had an abstract FilePath data type rather than
 FilePath = String we could do something different.

This is the general purpose of my system-filepath package, which
provides a set of generic modifications, applicable to paths from
various OS families.

 But it's not clear
 if we could, without also having ugliness like getArgs :: IO [Byte])

We *ought* to have getArgs :: IO [ByteString], at least on POSIX systems.

It's totally OK if high-level packages like directory want to hide
details behind some nice abstractions. But the low-level libraries,
like base and unix and Win32, must must must provide direct
low-level access to the operating system's APIs.

The only other option is to re-implement half of the standard library
using FFI bindings, which is ugly (for file/directory manipulation) or
nearly impossible (for opening handles).

If you're going to make all the System.IO stuff use text, at least
give us an escape hatch. The unix package is ideally suited, as it's
already inherently OS-specific. Something like this would be perfect:

--
System.Posix.File.openHandle :: CString - IOMode - IO Handle

System.Posix.File.rename :: CString - CString - IO ()
--

___
Glasgow-haskell-users mailing list
Glasgow-haskell-users@haskell.org
http://www.haskell.org/mailman/listinfo/glasgow-haskell-users


Re: behaviour change in getDirectoryContents in GHC 7.2?

2011-11-02 Thread John Millikin
On Wed, Nov 2, 2011 at 06:53, Max Bolingbroke
batterseapo...@hotmail.com wrote:
 I've got a patch that will work around the issue in most situations by
 avoiding the iconv code path. With the patch everything will work OK
 as long as the system locale is one that we have a native-Haskell
 decoder for (i.e. basically UTF-8). So you will still be able to get
 the broken behaviour if the above 3 conditions are met AND your system
 locale is not UTF-8.

What package does this patch -- unix, directory, something else?

 I think the only way to fix this last case in general is to fix iconv
 itself, so I'm going to see if I can get a patch upstream. Fixing it
 for people with UTF-8 locales should be enough for 99% of users,
 though.

Maybe I'm misunderstanding, but it sounds like you're still trying to
treat posix file paths as text. There should not be any iconv or
locales or anything involved in looking up a posix file path.

___
Glasgow-haskell-users mailing list
Glasgow-haskell-users@haskell.org
http://www.haskell.org/mailman/listinfo/glasgow-haskell-users


Re: behaviour change in getDirectoryContents in GHC 7.2?

2011-11-01 Thread John Millikin
You're right -- many parts of system-fileio (the parts based on
directory) are broken due to this. I'll need to update it to call
the posix/win32 functions directly.

IMO, the GHC behavior in =7.0 is ugly, but the behavior in 7.2 is
fundamentally wrong.

Different OSes have different definitions of a file path. A Windows
path is a sequence of Unicode characters. A Linux/BSD path is a
sequence of bytes. I'm not certain what OSX does, but I believe it
uses bytes also.

In GHC = 7.0, the String type was used for both sorts of paths, with
interpretation of the contents being OS-dependent. This sort of works,
because it's possible to represent both byte- and text-based paths in
String.

GHC 7.2 assumes Linux/BSD paths are text, which 1) silently breaks all
existing code and 2) makes it impossible to fix within the given API.

On Tue, Nov 1, 2011 at 08:48, Felipe Almeida Lessa
felipe.le...@gmail.com wrote:
 On Tue, Nov 1, 2011 at 5:16 AM, Ganesh Sittampalam gan...@earth.li wrote:
 I'm just investigating what we can do about a problem with darcs'
 handling of non-ASCII filenames on GHC 7.2.

 The issue is apparently that as of GHC 7.2, getDirectoryContents now
 tries to decode filenames in the current locale, rather than converting
 a stream of bytes into characters: http://bugs.darcs.net/issue2095

 I found an old thread on the subject:
 http://www.haskell.org/pipermail/haskell-cafe/2009-June/062795.html and
 some GHC tickets (e.g. http://hackage.haskell.org/trac/ghc/ticket/3300)

 Can anyone point me at the rationale and details of the change and/or
 suggest workarounds?

 You could try using system-fileio [1], but by reading its source code
 I guess that it may have the same bug (since it tries to decode what
 the directory package gives).  I'm CCing John Millikin, its
 maintainer.

 Cheers,

 [1] 
 http://hackage.haskell.org/packages/archive/system-fileio/0.3.2.1/doc/html/Filesystem.html#v:listDirectory

 --
 Felipe.


___
Glasgow-haskell-users mailing list
Glasgow-haskell-users@haskell.org
http://www.haskell.org/mailman/listinfo/glasgow-haskell-users


Re: behaviour change in getDirectoryContents in GHC 7.2?

2011-11-01 Thread John Millikin
On Tue, Nov 1, 2011 at 11:43, Max Bolingbroke
batterseapo...@hotmail.com wrote:
 Hi John,

 On 1 November 2011 17:14, John Millikin jmilli...@gmail.com wrote:
 GHC 7.2 assumes Linux/BSD paths are text, which 1) silently breaks all
 existing code and 2) makes it impossible to fix within the given API.

 Please can you give an example of code that is broken with the new
 behaviour? The PEP 383 mechanism will unavoidably break *some* code
 but I don't expect there to be much of it. One thing that most likely
 *will* be broken is code that attempts to reinterpret a String as a
 byte string - i.e. assuming that it was decoded using latin1, but I
 expect that such code can just be deleted when you upgrade to 7.2.

Examples of broken code are Darcs, my system-fileio, and likely
anything else which needs to open Unicode-named files in both 7.0 and
7.2.

As a quick example, consider the case of files with encodings
different from the user's locale. This is *very* common, especially
when interoperating with foreign Windows systems.

$ ghci-7.0.4
GHC import System.Directory
GHC createDirectory path-test
GHC writeFile path-test/\xA1\xA5 hello\n
GHC writeFile path-test/\xC2\xA1\xC2\xA5 world\n
GHC ^D

$ ghci-7.2.1
GHC import System.Directory
GHC getDirectoryContents path-test
[\161\165,\61345\61349,..,.]
GHC readFile path-test/\161\165
world\n
GHC readFile path-test/\61345\61349
*** Exception: path-test/: openFile: does not exist (No such file or
directory)

 As I pointed out earlier in the thread you can recover the old
 behaviour if you really want it by manually reencoding the strings, so
 I would dispute the claim that it is impossible to fix within the
 given API.

Please describe how I can, in GHC 7.2, read the contents of the file
path-test/\xA1\xA5 without changing my locale.

As far as I can tell, there is no way to do this using the standard
libraries. I would have to fall back to the unix package, or even
FFI imports, to open that file.

___
Glasgow-haskell-users mailing list
Glasgow-haskell-users@haskell.org
http://www.haskell.org/mailman/listinfo/glasgow-haskell-users