Re: COBOL was: COBOL Parser
In 4983992839630034.wa.paulgboulderaim@listserv.ua.edu, on 07/19/2013 at 09:15 AM, Paul Gilmartin paulgboul...@aim.com said: Mathematical/scientific notation tends to use single-character identifiers, to be case-sensitive, and to admit characters from Greek, Hebrew, ... Also. face and font are typically significant. -- Shmuel (Seymour J.) Metz, SysProg and JOAT Atid/2http://patriot.net/~shmuel We don't care. We don't have to care, we're Congress. (S877: The Shut up and Eat Your spam act of 2003) -- For IBM-MAIN subscribe / signoff / archive access instructions, send email to lists...@listserv.ua.edu with the message: INFO IBM-MAIN
Re: COBOL was: COBOL Parser
Some all but random responses. First, begin extract That's all well and good for two character values but obviously doesn't scale beyond a half-word. /end extract The problem I was addressing was a two-character/halfword one, and I raised it originally to illustrate one egregious way in which the IBM Enterprise COBOL compilers were deficient in their handling of language constructs, specifically EVALUATE for one- and two-character arguments. I chose this example because there are canonical, well understood, and not at all controversial branch-table based schemes available for implementing C switch statements and PL/I select(single character) groups. I was not proposing a panacea, about any and all of which I am generically suspicious. Next, begin extract I'm glad I said median, rather than mode. For the latter frequency statistics would be required. /end extract Frequency statistics are required for the former too. To say otherwise is to confound frequency class-interval definitions with their associated counts. The 'exponential' vs 'logarithmic' issue is more complicated, but here I do concede that these two algorithms are better contrasted using 'polynomial time' and 'logarithmic time'. I come now to begin extract If all the keys are already known then a perfect hash optimized switch statement is one of the most efficient methods I've seen. Especially popular with compiler writers for the symbol table. /end extract It seems to me to mix two topics. The reserved or key words of a statement-level language are already known. The identifiers used in a source program written in such a language are not. (One could of course conjecture that 'I' or 'i' will be used as the identifier of an indexing subscript in many, even most FORTRAN programs; but I am not sure that this sort of thing would be very helpful.) The C language switch statement is defined very specifically and carefully as an SBCS single-character facility. Not at all incidentally, this definition ensures that branch tables can always be used to implement a C switch statement. I do agree, if that be the right word, that hashing schemes have historically been much used for symbol-table management by the writers of compilers and assemblers. (If challenged before the fact, I should also be quite willing to 'agree' that the earth is an oblate spheroid.) John Gilmore, Ashland, MA 01721 - USA -- For IBM-MAIN subscribe / signoff / archive access instructions, send email to lists...@listserv.ua.edu with the message: INFO IBM-MAIN
Re: COBOL was: COBOL Parser
On Fri, 19 Jul 2013 08:19:19 -0400, John Gilmore wrote: The C language switch statement is defined very specifically and carefully as an SBCS single-character facility. Not at all incidentally, this definition ensures that branch tables can always be used to implement a C switch statement. FSVO always. From: C A Reference Manual THIRD EDITION Samuel P. Harbison Guy L. Steele Jr. PRENTICE HALL, Englewood Cliffs, NJ 07632 1991 Although ANSI C allows the control expression to be of any integer type, some older compilers do not permit it to be of type *long* or *unsigned long*. ANSI C also permits an implementation to limit the number of separate *case* labels in a *switch* statement to 257 (that is, more than enough to handle all values of type *char*). (Just distinguishing permits from ensures.) -- gil -- For IBM-MAIN subscribe / signoff / archive access instructions, send email to lists...@listserv.ua.edu with the message: INFO IBM-MAIN
Re: COBOL was: COBOL Parser
Let me make a distinction between values known a fortiori, like those of the keywords or reserved words of a statement-level language and classes of values defined by initial letter, length in characters, or the like. It is correct that IBM REXX treats single-letter identifiers differently from longer ones. MC has said so in two other fora. Organizing a symbol table/dictionary by identifier length is an old idea. For example, the Lowry-Medlock FORTRAN H compiler, to which I have referred here before, kept one-, two-, three-, four-, five- and six-character identifiers is separate subdictionaries. (The feasibility of this scheme rerflects a FORTRAN constraint thasty in turn reflects the architecture of the IBM 704 and its sequelæ. An identifier of at most six six-bit characters could always be kept in a 36-bit register.) There is strong evidence, derived from studies of large samples of source programs, that single-character identifiers are both fewer and referenced much more frequently than longer identifiers, at least in languages in the FORTRAN-ALGOL tradition. This probably reflects the contaminating influence of mathematical notation. (I have more than once been dismayed to see that COBOL programmers, who are largely free of this malign influence and who do not use subscripts/indices a lot, often give them long identifiers like 'in-batch-subgroup-index-position' when they do use them.) John Gilmore, Ashland, MA 01721 - USA -- For IBM-MAIN subscribe / signoff / archive access instructions, send email to lists...@listserv.ua.edu with the message: INFO IBM-MAIN
Re: COBOL was: COBOL Parser
On Fri, 19 Jul 2013 09:34:41 -0400, John Gilmore wrote: Organizing a symbol table/dictionary by identifier length is an old idea. For example, the Lowry-Medlock FORTRAN H compiler, to which I have referred here before, kept one-, two-, three-, four-, five- and six-character identifiers is separate subdictionaries. ... This strikes me as naive technology. Beyond regarding the length as a first-level hash index, I see little value in distinguishing among 3-, 4-, 5-, and 6-character identifiers. Hashing and binary search trees probably give better results. There is strong evidence, derived from studies of large samples of source programs, that single-character identifiers are both fewer and referenced much more frequently than longer identifiers, at least in languages in the FORTRAN-ALGOL tradition. There's a strict upper bound on the former; fairly small unless one admits Unicode. ... This probably reflects the contaminating influence of mathematical notation. (I have more than once been dismayed to see that COBOL programmers, who are largely free of this malign influence and who do not use subscripts/indices a lot, often give them long identifiers like 'in-batch-subgroup-index-position' when they do use them.) It's unclear which tradition you favor. Mathematical/scientific notation tends to use single-character identifiers, to be case-sensitive, and to admit characters from Greek, Hebrew, ... alphabets (ψ*, ℵo) (Unicode, again). (We needn't discuss APL.) I have on occasion refactored code to use longer identifiers, both for internal documentation and to facilitate string searches for them. Even in Rexx I place little value on the performance benefits of single character identifiers. (I know; modern development environments obviate the need for searching for identifiers as strings, but such tools are not uniformly available.) -- gil -- For IBM-MAIN subscribe / signoff / archive access instructions, send email to lists...@listserv.ua.edu with the message: INFO IBM-MAIN
Re: COBOL was: COBOL Parser
I 'favor' the mathematical tradition, but it is still infeasible, in full, for programming. It is likely that in that long run in which I shall be dead it will be feasible. For certain kinds of programming a close, very close, approximation to mathematical notation is already feasible; and it is notably perspicuous, at least for those who have been appropriately socialized. One of the principal reasons for urging, even insisting that programmers learn a number of different statement-level languages and one or more assembly languages is that mastering their different epistemes disentangles convention and necessity. John Gilmore, Ashland, MA 01721 - USA -- For IBM-MAIN subscribe / signoff / archive access instructions, send email to lists...@listserv.ua.edu with the message: INFO IBM-MAIN
Re: COBOL was: COBOL Parser
In CAE1XxDHQXLXdSU=uf=ica2nfkudeytf+rj4shxp3uy_m2rp...@mail.gmail.com, on 07/18/2013 at 04:13 PM, John Gilmore jwgli...@gmail.com said: Of course alphabetic is very inefficient. In the Unioted States more thsan half of the business in many insurance lines is in CA', 'NY', and 'TX'. Programmers keep such sets of if-then-elses in lexicographic sequence to simplify their maintenance. Some do. Others devise tools so that the maintenance is still simple but the code is more efficient. -- Shmuel (Seymour J.) Metz, SysProg and JOAT Atid/2http://patriot.net/~shmuel We don't care. We don't have to care, we're Congress. (S877: The Shut up and Eat Your spam act of 2003) -- For IBM-MAIN subscribe / signoff / archive access instructions, send email to lists...@listserv.ua.edu with the message: INFO IBM-MAIN
Re: COBOL was: COBOL Parser
On 19 July 2013 08:19, John Gilmore jwgli...@gmail.com wrote: Some all but random responses. [...] The C language switch statement is defined very specifically and carefully as an SBCS single-character facility. Not at all incidentally, this definition ensures that branch tables can always be used to implement a C switch statement. One can argue that in the highly constrained hardware environments where C grew up, a 256-byte branch table was far more of a barely thinkable extravagance than is a 64k one today. I would not be at all surprised to learn of some PDP-11 architectural construct that makes restricting switch statements to single byte values desirable for implementation purposes independent of the ability to use branch tables. Tony H. -- For IBM-MAIN subscribe / signoff / archive access instructions, send email to lists...@listserv.ua.edu with the message: INFO IBM-MAIN
Re: COBOL was: COBOL Parser
COBOL optimization has always been problematic for two reasons. The less important of them is that, until now at least, some syntactically interesting features of the language have been implemented very poorly. Some COBOL EVALUATEs are, for example, functionally very like a C switch or a PL/I select(single character). They have not, however, been inplemented in COBOL using a branch table, as such constructs are in C and PL/I, but as if nested IFs had been written. More important, COBOL arrays were for long too small to meet the requirements of many applications. For this reason, the practice of extending these arrays informally by (mis)appropriating the storage immediately following too-small ones for phantom elements addressed using formally illicit subscripts has been common. This is only one of a number of ways in which COBOL has been abused, written around by people who copied coding idioms without understanding them very well. Optimizers, understandably, did not recognize these dubious constructs; and the consequences were unfortunate. For these reasons many mainframe COBOL shops have interdicted the highest levels of optimization that IBM has made available. I think Timothy Sipples is right about the phrase well structured. Well structured COBOL programs do exist, but they are rare and usually short-lived: maintenance tends to convert them into rats' nests---the COBOL idiom is 'spaghetti code'---over time. I suspect that the intent of this IBM language is to suggest that even COBOL programs judged to be well written and efficient will benefit from being optimized. I also suspect that attempts to optimize old code that misuses COBOL language facilities systematically will continue to be problematic. This IBM initiative is nevertheless a very welcome one. It may prompt some mainframe COBOL shops to address economically important performance issues that they have been shirking. John Gilmore, Ashland, MA 01721 - USA -- For IBM-MAIN subscribe / signoff / archive access instructions, send email to lists...@listserv.ua.edu with the message: INFO IBM-MAIN
Re: COBOL was: COBOL Parser
On Thu, 18 Jul 2013 08:13:02 -0400, John Gilmore wrote: COBOL optimization has always been problematic for two reasons. The less important of them is that, until now at least, some syntactically interesting features of the language have been implemented very poorly. Some COBOL EVALUATEs are, for example, functionally very like a C switch or a PL/I select(single character). They have not, however, been inplemented in COBOL using a branch table, as such constructs are in C and PL/I, but as if nested IFs had been written. Which might be the proper implementation if the cases are sparse. Best to start with the median case value and bisect recursively. Storage is cheap nowadays, particularly if it's not accessed. But a large and sparse branch table is an invitation to page faults. More important, COBOL arrays were for long too small to meet the requirements of many applications. For this reason, the practice of extending these arrays informally by (mis)appropriating the storage immediately following too-small ones for phantom elements addressed using formally illicit subscripts has been common. In the past, you've deplored as nanny languages (IIRC) those which generate code to prevent such abuses. -- gil -- For IBM-MAIN subscribe / signoff / archive access instructions, send email to lists...@listserv.ua.edu with the message: INFO IBM-MAIN
Re: COBOL was: COBOL Parser
On 7/18/2013 6:46 AM, Paul Gilmartin wrote: On Thu, 18 Jul 2013 08:13:02 -0400, John Gilmore wrote: COBOL optimization has always been problematic for two reasons. The less important of them is that, until now at least, some syntactically interesting features of the language have been implemented very poorly. Some COBOL EVALUATEs are, for example, functionally very like a C switch or a PL/I select(single character). They have not, however, been inplemented in COBOL using a branch table, as such constructs are in C and PL/I, but as if nested IFs had been written. Which might be the proper implementation if the cases are sparse. Best to start with the median case value and bisect recursively. Storage is cheap nowadays, particularly if it's not accessed. But a large and sparse branch table is an invitation to page faults. I thought in today's systems paging was pretty rare. More important, COBOL arrays were for long too small to meet the requirements of many applications. For this reason, the practice of extending these arrays informally by (mis)appropriating the storage immediately following too-small ones for phantom elements addressed using formally illicit subscripts has been common. In the past, you've deplored as nanny languages (IIRC) those which generate code to prevent such abuses. -- gil -- For IBM-MAIN subscribe / signoff / archive access instructions, send email to lists...@listserv.ua.edu with the message: INFO IBM-MAIN -- For IBM-MAIN subscribe / signoff / archive access instructions, send email to lists...@listserv.ua.edu with the message: INFO IBM-MAIN
Re: COBOL was: COBOL Parser
Paul Gilmartin wrote: begin extract Which might be the proper implementation if the cases are sparse. Best to start with the median case value and bisect recursively. Storage is cheap nowadays, particularly if it's not accessed. But a large and sparse branch table is an invitation to page faults. /end extract I must admit that I am not quite sure what this means. The obvious interpretation is precluded. Compilers--unlike, say, RDBMs--are not ordinarily provided with frequency statistics for case values. If instead it means that a BST can be used to replace a polynomial-time linear-search scheme with an exponential-time binary-search one, I agree. Where it can be used, a branch table is nevertheless a very much better alternative. An example will help here. The American insurance industry is licensed and regulated chiefly, by the individual states, with New York State occupying the position of primus inter pares for historical reasons. Underwriting rules thus differ from state, and not just premium rates but even such minutiæ as penny-breakage rules are different from one state to another. In the upshot the industry has historically made much use of nested-if of the generic form if state-code is 'AL' then . . . else if state-code is 'AK' then . . . else if state-code is 'AR' then . . . else if state-code is 'AZ' then . . . . . . else if state-code is 'DC' then . . . . . . else if state-code is 'PR' then . . . . . else if state-code is 'WV' then . . . else if state-code us 'WY' then . . . else error sink which often occur multiply in the same routine. (I have in my black museum a a routine that contains 127 instances of this scheme.) A single branch-table replacement for these schemes requires a very sparse page-aligned table of 2^16 = 65536 bytes and 1992 bytes of very simple rentrant LE-compatible HLASM code. The table thus occupies 16 4k-byte pages, and careful instrumentation---I was anxious to establish performance benefits---showed that the branch-table scheme wax more than 8000 (sic) times faster than the nested-ifs even after the nested-if seq. Non-transient paging was looked for but not found. There is a tendency to react to TRTO-based schemes---OMG, a 64k table!---with the gut rather than the brain John Gilmore, Ashland, MA 01721 - USA -- For IBM-MAIN subscribe / signoff / archive access instructions, send email to lists...@listserv.ua.edu with the message: INFO IBM-MAIN
Re: COBOL was: COBOL Parser
Alphabetic would almost certainly be very inefficient. A minimal optimization would be a binary search of a table or states, or order the states in most to least order for the (1. number of transactions, 2. number of policies, 3. number of agents, 4, population).Setting a subscript once and grabbing the values of the table would be great. And the more values you have the better the binary search becomes. On Thu, Jul 18, 2013 at 2:09 PM, John Gilmore jwgli...@gmail.com wrote: Paul Gilmartin wrote: begin extract Which might be the proper implementation if the cases are sparse. Best to start with the median case value and bisect recursively. Storage is cheap nowadays, particularly if it's not accessed. But a large and sparse branch table is an invitation to page faults. /end extract I must admit that I am not quite sure what this means. The obvious interpretation is precluded. Compilers--unlike, say, RDBMs--are not ordinarily provided with frequency statistics for case values. If instead it means that a BST can be used to replace a polynomial-time linear-search scheme with an exponential-time binary-search one, I agree. Where it can be used, a branch table is nevertheless a very much better alternative. An example will help here. The American insurance industry is licensed and regulated chiefly, by the individual states, with New York State occupying the position of primus inter pares for historical reasons. Underwriting rules thus differ from state, and not just premium rates but even such minutiæ as penny-breakage rules are different from one state to another. In the upshot the industry has historically made much use of nested-if of the generic form if state-code is 'AL' then . . . else if state-code is 'AK' then . . . else if state-code is 'AR' then . . . else if state-code is 'AZ' then . . . . . . else if state-code is 'DC' then . . . . . . else if state-code is 'PR' then . . . . . else if state-code is 'WV' then . . . else if state-code us 'WY' then . . . else error sink which often occur multiply in the same routine. (I have in my black museum a a routine that contains 127 instances of this scheme.) A single branch-table replacement for these schemes requires a very sparse page-aligned table of 2^16 = 65536 bytes and 1992 bytes of very simple rentrant LE-compatible HLASM code. The table thus occupies 16 4k-byte pages, and careful instrumentation---I was anxious to establish performance benefits---showed that the branch-table scheme wax more than 8000 (sic) times faster than the nested-ifs even after the nested-if seq. Non-transient paging was looked for but not found. There is a tendency to react to TRTO-based schemes---OMG, a 64k table!---with the gut rather than the brain John Gilmore, Ashland, MA 01721 - USA -- For IBM-MAIN subscribe / signoff / archive access instructions, send email to lists...@listserv.ua.edu with the message: INFO IBM-MAIN -- Mike A Schwab, Springfield IL USA Where do Forest Rangers go to get away from it all? -- For IBM-MAIN subscribe / signoff / archive access instructions, send email to lists...@listserv.ua.edu with the message: INFO IBM-MAIN
Re: COBOL was: COBOL Parser
Why not a different algorithm, based on the number of possible alternatives? For instance, if only 2 or 3, then a cascading IF/THEN/ELSE. If over a thousand, maybe a hash table. Just a thought, I'm sure there are many possibilities. On Thu, Jul 18, 2013 at 3:02 PM, Mike Schwab mike.a.sch...@gmail.comwrote: Alphabetic would almost certainly be very inefficient. A minimal optimization would be a binary search of a table or states, or order the states in most to least order for the (1. number of transactions, 2. number of policies, 3. number of agents, 4, population).Setting a subscript once and grabbing the values of the table would be great. And the more values you have the better the binary search becomes. -- This is a test of the Emergency Broadcast System. If this had been an actual emergency, do you really think we'd stick around to tell you? Maranatha! John McKown -- For IBM-MAIN subscribe / signoff / archive access instructions, send email to lists...@listserv.ua.edu with the message: INFO IBM-MAIN
Re: COBOL was: COBOL Parser
Of course alphabetic is very inefficient. In the Unioted States more thsan half of the business in many insurance lines is in CA', 'NY', and 'TX'. Programmers keep such sets of if-then-elses in lexicographic sequence to simplify their maintenance. Moreover, they often continue to do so even after they have been shown with hard numbers thay they should not. One of the good things about the branch table is that, for now at least, putting it and its maintenance into assembly language converts it into a black box. John Gilmore, Ashland, MA 01721 - USA -- For IBM-MAIN subscribe / signoff / archive access instructions, send email to lists...@listserv.ua.edu with the message: INFO IBM-MAIN
Re: COBOL was: COBOL Parser
On Thu, 18 Jul 2013 15:09:22 -0400, John Gilmore wrote: Paul Gilmartin wrote: begin extract Which might be the proper implementation if the cases are sparse. Best to start with the median case value and bisect recursively. Storage is cheap nowadays, particularly if it's not accessed. But a large and sparse branch table is an invitation to page faults. /end extract I must admit that I am not quite sure what this means. The obvious interpretation is precluded. Compilers--unlike, say, RDBMs--are not ordinarily provided with frequency statistics for case values. If instead it means that a BST can be used to replace a polynomial-time linear-search scheme with an exponential-time binary-search one, I agree. Where it can be used, a branch table is nevertheless a very much better alternative. By sparse I was suggesting a hundred distinct entries among a million otherwise values. At some low density, the nested if becomes preferable. (The nested if can be considered a compiled version of the BST/) I'm glad I said median, rather than mode. For the latter frequency statistics would be required. And in Shmuel mode, I would have said logarithmic rather than exponential. -- gil -- For IBM-MAIN subscribe / signoff / archive access instructions, send email to lists...@listserv.ua.edu with the message: INFO IBM-MAIN
Re: COBOL was: COBOL Parser
I am beginning to have a suspicion that some readers at least do not know how branch tables. C switch statements, and the like work. Here, given a state-code of, say, 'NY', its two-byte code point, in EBCDIC x' d5e8', is used as an offset to address a unique byte in that table, one of bytes 0,1, 2,...,65635. A single properly framed TRTO instruction yields the value of that target byte. There is no faster way to do this, and this is the case even on, say, a RISC machine , where it must be programmed [trivially]. John McKown's suggestion that a three way branch should perhaps be implemented as if-then-else-if-then-else-if-then-elseerror sink is entirely correct but not interesting here. COBOL EVALUATE and its analogues in other languages are for many-way branches. John Gilmore, Ashland, MA 01721 - USA -- For IBM-MAIN subscribe / signoff / archive access instructions, send email to lists...@listserv.ua.edu with the message: INFO IBM-MAIN
Re: COBOL was: COBOL Parser
That's all well and good for two character values but obviously doesn't scale beyond a half-word. If all the keys are already known then a perfect hash optimized switch statement is one of the most efficient methods I've seen. Especially popular with compiler writers for the symbol table. On 19/07/2013 4:40 AM, John Gilmore wrote: I am beginning to have a suspicion that some readers at least do not know how branch tables. C switch statements, and the like work. Here, given a state-code of, say, 'NY', its two-byte code point, in EBCDIC x' d5e8', is used as an offset to address a unique byte in that table, one of bytes 0,1, 2,...,65635. A single properly framed TRTO instruction yields the value of that target byte. There is no faster way to do this, and this is the case even on, say, a RISC machine , where it must be programmed [trivially]. John McKown's suggestion that a three way branch should perhaps be implemented as if-then-else-if-then-else-if-then-elseerror sink is entirely correct but not interesting here. COBOL EVALUATE and its analogues in other languages are for many-way branches. John Gilmore, Ashland, MA 01721 - USA -- For IBM-MAIN subscribe / signoff / archive access instructions, send email to lists...@listserv.ua.edu with the message: INFO IBM-MAIN -- For IBM-MAIN subscribe / signoff / archive access instructions, send email to lists...@listserv.ua.edu with the message: INFO IBM-MAIN
Re: COBOL was: COBOL Parser
I wouldn't take well structured too literally here. Or, at least, I think it's best to consider the compiler developer's definition of well structured which is at least somewhat different than the application developer's. Again, this is fairly easy stuff to measure -- in the statistical sense anyway. Recompile a few representative programs with the new compiler optimizations enabled then measure the effects. They should be good, sometimes very good. Timothy Sipples GMU VCT Architect Executive (Based in Singapore) E-Mail: sipp...@sg.ibm.com -- For IBM-MAIN subscribe / signoff / archive access instructions, send email to lists...@listserv.ua.edu with the message: INFO IBM-MAIN