This Week on perl5-porters (8-14 March 2004)
This week was the can-of-Unicode-worms-festival week for the Perl 5
porters. Regular expressions were another recurrent topic. Read on for
details.
Unicode and UTF-8 coding
The Big Topic of this week was UTF-8, Unicode, and how Perl deals with
it.
This all started with a report about seemingly innocuous UTF-8 failures.
Digging into this deeper, Chip Salzenberg pointed out a flaw in Perl's
handling of Unicode strings: conversions from byte strings (with
"regular" eight-bit chars) to UTF-8 currently map high bit characters to
Unicode without translation (or, depending on how you look at it, by
implicitly assuming the byte strings are in Latin-1). This is
potentially wrong, because Perl assumes the C locale by default. Thus
upgrading a string to UTF-8 may change the meaning of its contents
regarding character classes, case mapping, etc. But this behaviour was
chosen in perl 5.8.x for backwards compatibilty.
http://groups.google.com/groups?selm=20040310170059.GE2455%40perlsupport.com
Jarkko Hietaniemi, former 5.8 pumpking and Unicode guru, stepped into
the discussion and provided insight. Various solutions were proposed and
discussed.
http://groups.google.com/groups?selm=40524044.1090704%40iki.fi
Should upgrade from byte strings that contain characters in the range
0x80-0xFF be forbidden, or emit a warning? Autrijus Tang, deciding to
speak in code, released a module on CPAN that implements this last
solution, and wishes them to be integrated into the core at some point
in the 5.9 development track. This would need also to turn the
"encoding" pragma into a lexically-scoped one (like "locale" currently
is.)
http://groups.google.com/groups?selm=1079275492.4005.8.camel%40localhost
http://search.cpan.org/~autrijus/encoding-warnings-0.03/lib/encoding/warnings.pm
While we're at it, Nick Ing-Simmons wonders what's the proper method for
XS coders to get UTF-8 data (without converting an SV to UTF-8 in place,
which is considered a Bad Thing). Sadahiro Tomoyuki provides some
answers.
http://groups.google.com/groups?selm=20040307181606.2729.7%40llama.ing-simmons.net
substr() lvalues
Ton Hospel reported (some time ago) bug #24346, concerning the behaviour
of the return value of substr() when it is used as an lvalue. He points
out, with examples, that the current situation is not satisfactory,
because the lvalue acts as a fixed-length window. This causes in some
cases some surprising action at distance, making a variable (coming from
the result of a substr()) hold a value different from the one it has
been assigned to.
Graham Barr fixed this problem. Nicholas, apparently, still hesitates
whether this should go in perl 5.8 or not, in the absence of any good
argument for or against.
http://groups.google.com/groups?selm=rt-24346-66654.1.8290615224722%40rt.perl.org
Regular expression bugs
Hugo reports that Damian reported that "use re 'eval'" is not seen in
patterns interpolated at run-time via /(??{...})/. Yitzchak
Scott-Thoennes explains that this comes from the fact that this
compile-time pragma setting is no longer seen at run-time (and this is
one more reason to rewrite the support for pragmas in the core.)
http://groups.google.com/groups?selm=200403100343.i2A3hWP03026%40zen.crypt.org
Hugo reports also a case of incorrect regexp compilation warning (bug
#27603) with /(??{...})/ blocks:
http://groups.google.com/groups?selm=rt-3.0.8-27603-81805.2.6610882472044%40perl.org
Jamie Lokier found a bug in the regular expression engine, more
precisely in the optimisation pass (bug #27515), leading to wrong
interpretation of the regular expression /^(.*)(?=x)x/. Hugo confirmed
that this was a known bug, possibly difficult to fix.
http://groups.google.com/groups?selm=rt-3.0.8-27515-81033.1.09945237479955%40perl.org
Jamie found also that using return() from a /(?{...})/ block may lead
to segmentation fault (bug #27595). Such blocks are considered
completely broken by the higher authorities (Dave Mitchell) and are
hopefully to be reimplemented.
Other bugs (and fixes)
Rafael reports that the source-filter-based Switch module is confused
by occurences the ($) function prototype in the filtered source. (Bug
#27472.)
Chip Salzenberg fixed the line-buffering problem noticed by Stas Bekman
last week.
Paul Kramer remarks that one can't change the ownership of a symlink
with perl's chown() built-in. Rafael suggests to add lchown() to the
POSIX module (which contains chown() already.) (Bug #27547.)
Nicholas Clark proposed a load of patches for Storable: fixes for
storing restricted hashes, references to "undef", plus a space
optimization. (Bug #27616.)
Releases
Arthur Bergman released the second development release of Ponie, which
seems to be impressive so far.
http://groups.google.com/groups?selm=EBE5CF35-7445-11D8-8D57-000A95A2734C%40nanisky.com
Tels released new versions of his math packages, Math::BigInt v1.70,
bignum 0.15, and Math::BigRat 0.12.
About this summary
This summary was written by Rafael Garcia-Suarez. Weekly summaries are
published on http://use.perl.org/ and posted on a mailing list, which
subscription address is [EMAIL PROTECTED] Comments and
corrections are welcome.