Re: Perl 5.16 vs Ruby 2.0 UTF-8 support

2013-08-30 Thread David Cantrell
On Fri, Aug 23, 2013 at 05:37:32PM +0100, Nic Gibson wrote: Just because it shows up in code doesn?t make it ASCII. There is no pound sterling character in the ASCII character set. If you are seeing this in your then your code is either a) not encoded as ASCII (probably Latin-1 or UTF-8)

Re: Perl 5.16 vs Ruby 2.0 UTF-8 support

2013-08-24 Thread Dave Cross
On 08/23/2013 05:32 PM, gvim wrote: On 23/08/2013 16:40, Dave Cross wrote: In your original email, you said: The problematic mail file doesn't display any non-ASCII characters when opened in Vim A pound sign is a non-ASCII character. By pound sign do you mean £ or #? I don't quite

Re: Perl 5.16 vs Ruby 2.0 UTF-8 support

2013-08-23 Thread gvim
On 22/08/2013 17:26, Dave Cross wrote: There's a pound sign at the end of that line. A3. That's your problem. Dave... Thanks. Appreciated. gvim

Re: Perl 5.16 vs Ruby 2.0 UTF-8 support

2013-08-23 Thread Dave Cross
Quoting gvim gvi...@gmail.com: On 22/08/2013 17:26, Dave Cross wrote: There's a pound sign at the end of that line. A3. That's your problem. Thanks. Appreciated. In your original email, you said: The problematic mail file doesn't display any non-ASCII characters when opened in Vim A

Re: Perl 5.16 vs Ruby 2.0 UTF-8 support

2013-08-23 Thread gvim
On 23/08/2013 16:40, Dave Cross wrote: In your original email, you said: The problematic mail file doesn't display any non-ASCII characters when opened in Vim A pound sign is a non-ASCII character. Dave... By pound sign do you mean £ or #? I don't quite understand as both characters show

Re: Perl 5.16 vs Ruby 2.0 UTF-8 support

2013-08-23 Thread Roger Bell_West
On Fri, Aug 23, 2013 at 05:32:17PM +0100, gvim wrote: By pound sign do you mean ? or #? I don't quite understand as both characters show up normally in code, ie. comments and currency. You didn't specify that you were speaking American. In the rest of the world,

Re: Perl 5.16 vs Ruby 2.0 UTF-8 support

2013-08-23 Thread Nic Gibson
On 23 Aug 2013, at 17:32, gvim gvi...@gmail.com wrote: On 23/08/2013 16:40, Dave Cross wrote: In your original email, you said: The problematic mail file doesn't display any non-ASCII characters when opened in Vim A pound sign is a non-ASCII character. Dave... By pound sign do

Re: Perl 5.16 vs Ruby 2.0 UTF-8 support

2013-08-23 Thread James Laver
On Fri, Aug 23, 2013 at 5:37 PM, Nic Gibson n...@corbas.co.uk wrote: If you are seeing this in your then your code is either a) not encoded as ASCII (probably Latin-1 or UTF-8) or b) broken If you're seeing this in client-provided CSV, I recommend running a mile. Notes from the recent

Re: Perl 5.16 vs Ruby 2.0 UTF-8 support

2013-08-23 Thread Matt Lawrence
On 23/08/2013 19:45, James Laver wrote: - UTF-8 is a great interchange format. But it's quite annoying perl doesn't have a flag to automatically en/decode to/from UTF-8 as regards STDIN and STDOUT (and in the case of STDIN, probably anything that uses) Doesn't the -C switch count? Or indeed

Re: Perl 5.16 vs Ruby 2.0 UTF-8 support

2013-08-23 Thread James Laver
On Fri, Aug 23, 2013 at 8:01 PM, Matt Lawrence matt.lawre...@virgin.net wrote: Doesn't the -C switch count? Or indeed the PERL_UNICODE environment variable. Matt I take it back. Wish I'd known about this 2 days ago, of course.

Perl 5.16 vs Ruby 2.0 UTF-8 support

2013-08-22 Thread gvim
Can anyone who also uses Ruby enlighten me? For benchmarking purposes this Perl 5.16 script works fine parsing a large Maildir folder: use 5.016; use autodie; my $dir = 'my/mail/path'; chdir $dir; opendir my $dh, $dir; while

Re: Perl 5.16 vs Ruby 2.0 UTF-8 support

2013-08-22 Thread Joel Bernstein
You can use the ruby String#encode method to force UTF-8 encoding on the string and have invalid byte sequences replaced. At a guess your perl code is happy with the invalid sequence because it's not treating the string as unicode at all. I'd expect it to fail in the same way if you force the

Re: Perl 5.16 vs Ruby 2.0 UTF-8 support

2013-08-22 Thread Dave Cross
Quoting gvim gvi...@gmail.com: Can anyone who also uses Ruby enlighten me? For benchmarking purposes this Perl 5.16 script works fine parsing a large Maildir folder: use 5.016; use autodie; my $dir = 'my/mail/path'; chdir $dir;

Re: Perl 5.16 vs Ruby 2.0 UTF-8 support

2013-08-22 Thread Paul Makepeace
On Thu, Aug 22, 2013 at 8:39 AM, gvim gvi...@gmail.com wrote: The problematic mail file doesn't display any non-ASCII characters when opened in Vim. Here's the Ruby 2.0 error message: How about when you hexdump it?

Re: Perl 5.16 vs Ruby 2.0 UTF-8 support

2013-08-22 Thread gvim
On 22/08/2013 16:59, Dave Cross wrote: Without seeing your data (or knowing anything much about Ruby's string-handling) I'd guess that your file is in one of the extended ASCII character sets (probably ISO-8859-1 or cp1252). You haven't told Perl to decode the data in any way, so it's just

Re: Perl 5.16 vs Ruby 2.0 UTF-8 support

2013-08-22 Thread Joel Bernstein
What problematic char? Why not just tell Ruby your strings are Latin-1? BTW Latin-1 is not ASCII. If your data really *was* ASCII (a 7-bit charset), as you had claimed, it would also be perfectly valid UTF-8. To be clear, Ruby is correct, but if you tell it your data isn't in the encoding it

Re: Perl 5.16 vs Ruby 2.0 UTF-8 support

2013-08-22 Thread gvim
On 22/08/2013 17:05, Paul Makepeace wrote: How about when you hexdump it? I wouldn't know but here's the result of hexdump -C (literal text removed from line end): 58 2d 4d 6f 7a 69 6c 6c 61 2d 4b 65 79 73 3a 20 0010 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 *

Re: Perl 5.16 vs Ruby 2.0 UTF-8 support

2013-08-22 Thread Dave Cross
Quoting gvim gvi...@gmail.com: On 22/08/2013 17:05, Paul Makepeace wrote: How about when you hexdump it? I wouldn't know but here's the result of hexdump -C (literal text removed from line end): 0560 75 67 68 74 20 66 6f 72 20 75 6e 64 65 72 20 a3 There's a pound sign at the

Re: Perl 5.16 vs Ruby 2.0 UTF-8 support

2013-08-22 Thread Paul Makepeace
On Thu, Aug 22, 2013 at 9:15 AM, gvim gvi...@gmail.com wrote: On 22/08/2013 17:05, Paul Makepeace wrote: How about when you hexdump it? I wouldn't know but here's the result of hexdump -C (literal text removed from line end): You're looking for high bits in the characters, as a first