https://bz.apache.org/SpamAssassin/show_bug.cgi?id=7232
Bug ID: 7232
Summary: Getting rid of 'use bytes' crouches throughout
Product: Spamassassin
Version: SVN Trunk (Latest Devel Version)
Hardware: All
OS: All
Status: NEW
Severity: enhancement
Priority: P2
Component: Libraries
Assignee: [email protected]
Reporter: [email protected]
I'd like to comment-out (or delete) the 'use bytes' in all modules,
in preparation for a more sensible Unicode use internally.
So far the historical use of 'use bytes' has already bitten us
at least twice (Bug 7215 and in bayes tokenization few months ago).
It is sprinkled all over the place, even though it may have been
needed in only a couple of places.
The 'bytes' pragma man page says:
NAME
bytes - Perl pragma to force byte semantics rather than character
semantics
NOTICE
This pragma reflects early attempts to incorporate Unicode into perl
and has since been superseded. It breaks encapsulation (i.e. it exposes
the innards of how the perl executable currently happens to store a
string), and use of this module for anything other than debugging
purposes is strongly discouraged. If you feel that the functions here
within might be useful for your application, this possibly indicates a
mismatch between your mental model of Perl Unicode and the current
reality. In that case, you may wish to read some of the perl Unicode
documentation: perluniintro, perlunitut, perlunifaq and perlunicode.
Its use affects functions ord, chr, length, substr, index, rindex.
If there is ever a need to convert Unicode into UTF-8 octets,
it should be done explicitly, e.g. through utf8::encode($s),
possibly conditionalized by: if utf8::is_utf8($s)
I believe this explicit encoding has already been done in most
cases where it was necessary. Nevertheless we should keep eye open
for some corner cases which may pop up.
The patch is purely mechanical:
$ perl -i -pe 's/^(\s*)use\s+bytes\s*;/$1# use bytes;/'
and can be easily reverted if necessary.
All tests pass (5.22.0 and 5.8.9). In a couple of hours since
I'm running this code (with charset normalization enabled)
I haven't noticed anything unusual (like warnings or changes in
bayes tokenization). There is also no change/slowdown in timing,
but that's expected as rules still are (mostly?) not yet exposed
to Unicode.
--
You are receiving this mail because:
You are the assignee for the bug.