Re: Count Words?

Seymour J Metz Sun, 17 Jun 2018 13:39:02 -0700

What are

    U+00AB      left-pointing double angle quotation mark       &laquo;
    U+00BB      right-pointing double angle quotation mark      &raquo; 
    U+2018      left single quotation mark                      &lsquo;
    U+2019      right single quotation mark                     &rsquo;
    U+201C      left double quotation mark                      &ldquo;
    U+201D      right double quotation mark                     &rdquo;


chopped liver?


--
Shmuel (Seymour J.) Metz
http://mason.gmu.edu/~smetz3

________________________________________
From: IBM Mainframe Assembler List <[email protected]> on behalf 
of Dan Greiner <[email protected]>
Sent: Saturday, June 16, 2018 12:23 AM
To: [email protected]
Subject: Re: Count Words?

As with any multiple-byte operand with no alignment requirements, the second 
operand of TRT (containing the function bytes) can span a cache line or page 
boundary. So, unless the programmer is exceedingly confident of the content of 
the first operand of TRT (i.e., the stuff that's being parsed), she or he MUST 
assume that any byte of the second operand may be accessed. This is not a fault 
of the instruction ... it's just how it is!

If you're writing typical application code, that's all you need worry about.  
Sure, the second operand could cross a cache line, requiring a delay to fetch 
the data from a higher-level cache or from main memory.  But, assuming this is 
relatively frequently executed code, once the data are fetched, that's over ... 
and the cache line will stay hot if it continues to be frequently referenced.  
Similarly, for a page-translation exception, if we assume that the program is 
bug-free, then the result will simply be a page fault, the OS will roll in the 
page frame, and (again, assuming frequent execution), the page will stay 
resident.  If the code is not frequently executed, any angst over performance 
is somewhat moot.

As to parsing most languages, delimiting characters usually occur within the 
first 128 bytes for both ASCII and EBCDIC (although EBCDIC alphabetic and 
numeric codes are in the second 128 bytes).  This is also true for UTF-16 
characters; that is, the delimiting characters like common punctuation and 
white space are within the first 128 bytes of the function-code table.  TRTE 
and TRTRE contain an interesting feature that allows you to parse a double-byte 
first operand (e.g., UTF-16), but only require a 256-byte function-code table; 
any first-operand character > 256 is assumed to access a function code of 
zeros.  This feature is specifically designed for sifting out common language 
delimiters in 2-byte character sets.

SHARE session 1245 (from SHARE 113 in Denver, 2009) illustrates (among other 
things) a finite-state parser, comparing TRT versus TRTE. This session also 
contains suggestions on timing various application code fragments, so you can 
figure out for yourself how fast or slow a code sequence really is. If you 
can't find it on the SHARE web site, send me a back-channel note and I'll 
forward a copy.

Re: Count Words?

Reply via email to