[Bug 7232] New: Getting rid of 'use bytes' crouches throughout

bugzilla-daemon Thu, 06 Aug 2015 10:53:28 -0700

https://bz.apache.org/SpamAssassin/show_bug.cgi?id=7232


            Bug ID: 7232
           Summary: Getting rid of 'use bytes' crouches throughout
           Product: Spamassassin
           Version: SVN Trunk (Latest Devel Version)
          Hardware: All
                OS: All
            Status: NEW
          Severity: enhancement
          Priority: P2
         Component: Libraries
          Assignee: [email protected]
          Reporter: [email protected]

I'd like to comment-out (or delete) the 'use bytes' in all modules,
in preparation for a more sensible Unicode use internally.

So far the historical use of 'use bytes' has already bitten us
at least twice (Bug 7215 and in bayes tokenization few months ago).
It is sprinkled all over the place, even though it may have been
needed in only a couple of places.


The 'bytes' pragma man page says:


NAME
  bytes - Perl pragma to force byte semantics rather than character
  semantics

NOTICE
  This pragma reflects early attempts to incorporate Unicode into perl
  and has since been superseded. It breaks encapsulation (i.e. it exposes
  the innards of how the perl executable currently happens to store a
  string), and use of this module for anything other than debugging
  purposes is strongly discouraged. If you feel that the functions here
  within might be useful for your application, this possibly indicates a
  mismatch between your mental model of Perl Unicode and the current
  reality. In that case, you may wish to read some of the perl Unicode
  documentation: perluniintro, perlunitut, perlunifaq and perlunicode.


Its use affects functions ord, chr, length, substr, index, rindex.

If there is ever a need to convert Unicode into UTF-8 octets,
it should be done explicitly, e.g. through utf8::encode($s),
possibly conditionalized by:  if utf8::is_utf8($s)

I believe this explicit encoding has already been done in most
cases where it was necessary. Nevertheless we should keep eye open
for some corner cases which may pop up.


The patch is purely mechanical:
  $ perl -i -pe 's/^(\s*)use\s+bytes\s*;/$1# use bytes;/'
and can be easily reverted if necessary.

All tests pass (5.22.0 and 5.8.9). In a couple of hours since
I'm running this code (with charset normalization enabled)
I haven't noticed anything unusual (like warnings or changes in
bayes tokenization). There is also no change/slowdown in timing,
but that's expected as rules still are (mostly?) not yet exposed
to Unicode.

-- 
You are receiving this mail because:
You are the assignee for the bug.

[Bug 7232] New: Getting rid of 'use bytes' crouches throughout

Reply via email to