Re: What is a punctuation character?

Gabriel Dos Reis Mon, 19 Mar 2012 02:57:40 -0700

On Mon, Mar 19, 2012 at 4:34 AM, Simon Marlow <[email protected]> wrote:
>> On Fri, Mar 16, 2012 at 6:49 PM, Ian Lynagh <[email protected]> wrote:
>> > Hi Gaby,
>> >
>> > On Fri, Mar 16, 2012 at 06:29:24PM -0500, Gabriel Dos Reis wrote:
>> >>
>> >> OK, thanks!  I guess a take away from this discussion is that what is
>> >> a punctuation is far less well defined than it appears...
>> >
>> > I'm not really sure what you're asking. Haskell's uniSymbol includes
>> > all Unicode characters (should that be codepoints? I'm not a Unicode
>> > expert) in the punctuation category; I'm not sure what the best
>> > reference is, but e.g. table 12 in
>> >    http://www.unicode.org/reports/tr44/tr44-8.html#Property_Values
>> > lists a number of Px categories, and a meta-category P "Punctuation".
>> >
>> >
>> > Thanks
>> > Ian
>> >
>>
>> Hi Ian,
>>
>> I guess what I am asking was partly summarized in Iavor's message.
>>
>> For me, the issue started with bullet number 4 in section 1.1
>>
>>      http://www.haskell.org/onlinereport/intro.html#sect1.1
>>
>> which states that:
>>
>>        The lexical structure captures the concrete representation
>>        of Haskell programs in text files.
>>
>> That combined with the opening section 2.1 (e.g. example of terminal
>> syntax) and the fact that the grammar  routinely described two non-
>> terminals ascXXX (for ASCII characters) and uniXXX for (Unicode character)
>> suggested that the concrete syntax of Haskell programs in text files is in
>> ASCII charset.  Note this does not conflict with the general statement
>> that Haskell programs use the Unicode character because the uniXXX could
>> use the ASCII charset to introduce Unicode characters -- this is not
>> uncommon practice for programming languages using Unicode characters; see
>> the link I gave earlier.
>>
>> However, if I understand Malcolm's message correctly, this is not the
>> case.
>> Contrary to what I quoted above, Chapter 2 does NOT specify the concrete
>> representation of Haskell programs in text files.  What it does is to
>> capture the structure of what is obtained from interpreting, *in some
>> unspecified encoding or unspecified alphabet*,  the concrete
>> representation of Haskell programs in text files.  This conclusion is
>> unfortunate, but I believe it is correct.
>> Since the encoding or the alphabet is unspecified, it is no longer
>> necessarily the case that two Haskell implementations would agree on the
>> same lexical interpretation when presented with the same exact text file
>> containing  a Haskell program.
>>
>> In its current form, you are correct that the Report should say
>> "codepoint"
>> instead of characters.
>>
>> I join Iavor's request in clarifying the alphabet used in the grammar.
>
> The report gives meaning to a sequence of codepoints only, it says nothing 
> about how that sequence of codepoints is represented as a string of bytes in 
> a file, nor does it say anything about what those files are called, or even 
> whether there are files at all.


Thanks, Simon.

The fact that the Report is silent about encoding used to
represent concrete Haskell programs in text files adds
a certain level of non-portability (and confusion.)  I found
last night that a proposal has been made to add some
support for encoding specification

    http://hackage.haskell.org/trac/haskell-prime/wiki/UnicodeInHaskellSource

I believe that is a good start.  What are the odds of it being considered
for Haskell 2012?  I suspect the pragma proposal works only if something
is said about the position of that pragma in the source file (e.g. it
must be the
first line, or file N bytes in the source file) otherwise we have an
infinite descent.


>
> Perhaps some clarification is in order in a future revision, and we should 
> use the correct terminology where appropriate.  We should also clarify that 
> "punctuation" means exactly the Punctuation class.

That would be great.  Do you have any comment about the
UnicodeInHaskellSource proposal?

> With regards to normalisation and equivalence, my understanding is that 
> Haskell does not support either: two identifiers are equal if and only if 
> they are represented by the same sequence of codepoints.  Again, we could add 
> a clarifying sentence to the report.
>

Ugh.

Writing a parser for Haskell was an interesting exercise :-)

-- Gaby

_______________________________________________
Haskell-prime mailing list
[email protected]
http://www.haskell.org/mailman/listinfo/haskell-prime

Re: What is a punctuation character?

Reply via email to