Perl 5.16 vs Ruby 2.0 UTF-8 support

2013-08-22 Thread gvim
Can anyone who also uses Ruby enlighten me? For benchmarking purposes this Perl 5.16 script works fine parsing a large Maildir folder: use 5.016; use autodie; my $dir = 'my/mail/path'; chdir $dir; opendir my $dh, $dir; while

Re: Perl 5.16 vs Ruby 2.0 UTF-8 support

2013-08-22 Thread Joel Bernstein
You can use the ruby String#encode method to force UTF-8 encoding on the string and have invalid byte sequences replaced. At a guess your perl code is happy with the invalid sequence because it's not treating the string as unicode at all. I'd expect it to fail in the same way if you force the

Re: Perl 5.16 vs Ruby 2.0 UTF-8 support

2013-08-22 Thread Dave Cross
Quoting gvim gvi...@gmail.com: Can anyone who also uses Ruby enlighten me? For benchmarking purposes this Perl 5.16 script works fine parsing a large Maildir folder: use 5.016; use autodie; my $dir = 'my/mail/path'; chdir $dir;

Re: Perl 5.16 vs Ruby 2.0 UTF-8 support

2013-08-22 Thread Paul Makepeace
On Thu, Aug 22, 2013 at 8:39 AM, gvim gvi...@gmail.com wrote: The problematic mail file doesn't display any non-ASCII characters when opened in Vim. Here's the Ruby 2.0 error message: How about when you hexdump it?

Re: Perl 5.16 vs Ruby 2.0 UTF-8 support

2013-08-22 Thread gvim
On 22/08/2013 16:59, Dave Cross wrote: Without seeing your data (or knowing anything much about Ruby's string-handling) I'd guess that your file is in one of the extended ASCII character sets (probably ISO-8859-1 or cp1252). You haven't told Perl to decode the data in any way, so it's just

Re: Perl 5.16 vs Ruby 2.0 UTF-8 support

2013-08-22 Thread Joel Bernstein
What problematic char? Why not just tell Ruby your strings are Latin-1? BTW Latin-1 is not ASCII. If your data really *was* ASCII (a 7-bit charset), as you had claimed, it would also be perfectly valid UTF-8. To be clear, Ruby is correct, but if you tell it your data isn't in the encoding it

Re: Perl 5.16 vs Ruby 2.0 UTF-8 support

2013-08-22 Thread gvim
On 22/08/2013 17:05, Paul Makepeace wrote: How about when you hexdump it? I wouldn't know but here's the result of hexdump -C (literal text removed from line end): 58 2d 4d 6f 7a 69 6c 6c 61 2d 4b 65 79 73 3a 20 0010 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 *

Re: Perl 5.16 vs Ruby 2.0 UTF-8 support

2013-08-22 Thread Dave Cross
Quoting gvim gvi...@gmail.com: On 22/08/2013 17:05, Paul Makepeace wrote: How about when you hexdump it? I wouldn't know but here's the result of hexdump -C (literal text removed from line end): 0560 75 67 68 74 20 66 6f 72 20 75 6e 64 65 72 20 a3 There's a pound sign at the

Re: Perl 5.16 vs Ruby 2.0 UTF-8 support

2013-08-22 Thread Paul Makepeace
On Thu, Aug 22, 2013 at 9:15 AM, gvim gvi...@gmail.com wrote: On 22/08/2013 17:05, Paul Makepeace wrote: How about when you hexdump it? I wouldn't know but here's the result of hexdump -C (literal text removed from line end): You're looking for high bits in the characters, as a first