What are
U+00AB left-pointing double angle quotation mark «
U+00BB right-pointing double angle quotation mark »
U+2018 left single quotation mark ‘
U+2019 right single quotation mark ’
U+201C left double quotation mark “
U+201D right double quotation mark ”
chopped liver?
--
Shmuel (Seymour J.) Metz
http://mason.gmu.edu/~smetz3
________________________________________
From: IBM Mainframe Assembler List <[email protected]> on behalf
of Dan Greiner <[email protected]>
Sent: Saturday, June 16, 2018 12:23 AM
To: [email protected]
Subject: Re: Count Words?
As with any multiple-byte operand with no alignment requirements, the second
operand of TRT (containing the function bytes) can span a cache line or page
boundary. So, unless the programmer is exceedingly confident of the content of
the first operand of TRT (i.e., the stuff that's being parsed), she or he MUST
assume that any byte of the second operand may be accessed. This is not a fault
of the instruction ... it's just how it is!
If you're writing typical application code, that's all you need worry about.
Sure, the second operand could cross a cache line, requiring a delay to fetch
the data from a higher-level cache or from main memory. But, assuming this is
relatively frequently executed code, once the data are fetched, that's over ...
and the cache line will stay hot if it continues to be frequently referenced.
Similarly, for a page-translation exception, if we assume that the program is
bug-free, then the result will simply be a page fault, the OS will roll in the
page frame, and (again, assuming frequent execution), the page will stay
resident. If the code is not frequently executed, any angst over performance
is somewhat moot.
As to parsing most languages, delimiting characters usually occur within the
first 128 bytes for both ASCII and EBCDIC (although EBCDIC alphabetic and
numeric codes are in the second 128 bytes). This is also true for UTF-16
characters; that is, the delimiting characters like common punctuation and
white space are within the first 128 bytes of the function-code table. TRTE
and TRTRE contain an interesting feature that allows you to parse a double-byte
first operand (e.g., UTF-16), but only require a 256-byte function-code table;
any first-operand character > 256 is assumed to access a function code of
zeros. This feature is specifically designed for sifting out common language
delimiters in 2-byte character sets.
SHARE session 1245 (from SHARE 113 in Denver, 2009) illustrates (among other
things) a finite-state parser, comparing TRT versus TRTE. This session also
contains suggestions on timing various application code fragments, so you can
figure out for yourself how fast or slow a code sequence really is. If you
can't find it on the SHARE web site, send me a back-channel note and I'll
forward a copy.