I'm trying to work out the meaning of TUS 8.0 Section 23.2. To do Thai word breaking properly, one needs to do a semantic analysis of the text to do the equivalent of resolving the equivalent of 'humanevents' into 'human events' rather than 'humane vents'. One also needs to cope with unknown and misspelt words. (A lot of effort has been devoted to avoid going to the extreme of doing semantic analysis.) However, it is possible to read Section 23.2 as prohibiting the use of certain information, and I would like to check whether this is the intended meaning.
The opening paragraph seems clear enough on first reading: "The effect of layout controls is specific to particular text processes. As much as possible, lay-out controls are transparent to those text processes for which they were not intended. In other words, their effects are mutually orthogonal." However, my first question is, "Are paragraph boundaries directly admissible as evidence for or against word boundaries not adjacent to them?". For example, most Thai word breakers would not regard a paragraph boundary as any more significant than a phrase-delimiting space. However, a paragraph boundary often indicates a change of topic. My second question is, "Are line breaks admissible as evidence for or against word boundaries not adjacent to them?" For example, if a phrase makes heavy use of U+200B ZERO WIDTH SPACE (ZWSP), one may deduce that it is likely that all word boundaries within it are marked explicitly. This example is more useful for Khmer than to Thai, for whereas Cambodians were once taught to mark word boundaries, Thais rarely use ZWSP to mark word boundaries. My third question is, "Is the absence of a line break opportunity admissible as evidence for or against a word boundary?". Here I see conflicting signals. There is a character U+2060 WORD JOINER (WJ) which *used* to be regarded as the counterpart of ZWSP. The understanding was that ZSWP marked a word boundary and provided a line-break opportunity, while WJ denied both. This, however, is no longer the case. To quote the TUS section about WJ: P1: (Ignored) P2S1: The word joiner must not be confused with the zero width joiner or the combining grapheme joiner, which have very different functions. P2S2: In particular, inserting a word joiner between two characters has no effect on their ligating and cursive joining behavior. P2S3: The word joiner should be ignored in contexts other than line breaking. P2S4: Note in particular that the word joiner is ignored for word segmentation. P2S5: (See Unicode Standard Annex #29, “Unicode Text Segmentation.”) Paragraph 2 Sentence 3 (P2S3) appears to rule out its use in word-breaking, but perhaps it does not if line-breaking is being used as evidence for word boundaries. P2S4 has three very different interpretations: (i) This is an assertion of fact, and may therefore be incorrect. (ii) The word 'is' is sloppy wording for 'should be'. Section 23.2 contains much sloppier wording, as I have already advised members of the UTC (4 July 2015). (iii) This is a deduction from other parts of the specification. Now, if P2S4 said 'is normally ignored for word segmentation', that would have made sense, for that applies to the default word boundary specification in UAX#29. However, just before Section 4.1, UAX#29 explains that it does not specify what happens for word boundary determination in Thai! (It does constrain what happens, though.) At the end of UAX#29 Section 6.2, there is the provision, "The Ignore rules should not be overridden by tailorings, with the possible exception of remapping some of the Format characters to other classes." To accord with the user perceptions of Unicode-aware people who work with SE Asian scripts, I am tempted to ask for CLDR to tailor the word-breaking algorithms for the corresponding languages so that the word-breaking classes of WJ (and ZWNBSP) are changed from Format to MidLetter. That would match the widespread old *perception* that there should be no word break in a sequence <Thai letter, (Thai mark,)* WJ, Thai letter>. However, there are several objections: (a) Perhaps P2S3 and P2S4 prohibit this. (b) If the word-break property of Thai letters falls back to Other, there would still be a word break between them. (c) If the word-break property of Thai letters fell back to ALetter, an old suggestion, WJ would have no effect on the presence of a word break. (d) If Thai word breaking assigns word-break classes to each letter (gc=Lo), then word boundaries can be suppressed by choosing the classes appropriately. Non-spacing Thai vowels are very relevant to Thai word-breaking, but formally are 'ignored'. WJ could be 'ignored' in exactly the same way. Richard.