Internationalization: Comments on Text Segmentation straw man

Norbert Lindenberg Thu, 18 Apr 2013 17:03:44 -0700

In preparation for tomorrow's internationalization ad-hoc meeting, I reviewed 
the 
Text Segmentation strawman:
http://wiki.ecmascript.org/doku.php?id=globalization:text_segmentation


Some issues go beyond internationalization and into general API design question 
- I'd appreciate input from the es-discuss crowd on them.

1) Text segmentation often works on small portions of potentially large 
documents. For example, when the user clicks or taps on a word, an application 
may need to find the word being selected. Only the text under the click/tap 
location is of interest, but that text may be part of a large DOM tree (not 
necessarily all within one node!). Using String values as input means that the 
application has to create a string including enough context so that the 
complete desired segment and some text outside the segment is guaranteed to be 
included - typically a paragraph. Some libraries, including ICU, accept 
alternative input types: iterators that can move forward and (unlike ES6 
iterators) backward over strings, or generic interfaces that provide access to 
text. Should our text segmenters allow such alternative input types, or is it 
reasonable to expect callers to construct String values?

2) The strawman includes an extension to String.prototype.split, allowing it to 
accept a TextSegmenter as the separator argument. The only way to detect 
whether this argument will be understood would be indirect, by checking whether 
Intl.TextSegmenter exists. Is that acceptable?


Other comments are mostly for the ad-hoc team:

3) It would be useful to have more detail on the use cases, including how the 
text to be segmented might be represented, and whether segments would typically 
be accessed sequentially or randomly. Some indication of how common each use 
case is expected to be in JavaScript applications would also be useful.

4) I think we should drop paragraph breaking. Paragraphs are usually defined by 
the document type - in plain text possibly any CR/LF combination, or two 
consecutive such combinations, or U+2029; in HTML the text within a <p> element 
and various other entities; etc. ES text segmenters shouldn't have to know 
document types.

5) Do we expect tailored grapheme clusters to be supported and commonly used? 
It seems to me that default grapheme clusters should be handled in regular 
expressions, not in this special-purpose API.

6) Line breaks on the other hand should be provided.

7) Is numSegments() really needed? If so, it should be countSegments() or such 
to indicate that it's a rather expensive operation.

8) We need to define more precisely what is meant by the different segment 
types. E.g., The notes on segmentType indicate that whitespace and punctuation 
are returned as separate words - is that generally accepted? Should sequences 
of punctuation or whitespace be treated as one word or multiple? Will line 
breaking report U+00AD as a breaking opportunity? Is it safe to assume that in 
the absence of U+00AD the segments reported are appropriate as input to a 
separate hyphenation engine, or do such engines need more context? Normative 
references to UTRs 14 and 29 might help.

9) Why isn't segmentType just a property of the object returned by 
segmentContaining?

10) Should positions be based on UTF-16 code units or Unicode code points? Code 
points seem more logical, but make it more difficult to map to underlying 
strings.

11) Should the end property of the object returned by segmentContaining be the 
index of the last character/code unit included in the segment, or the index 
after that character/code unit? The latter would be more compatible with 
String.prototype.substring.

12) If the second edition of ECMA-402 is based on ES6, this API should provide 
iterators:
http://wiki.ecmascript.org/doku.php?id=harmony:iterators

13) "Anything else defaults to “word”." should be "Anything else results in a 
RangeError exception."

Norbert

_______________________________________________
es-discuss mailing list
[email protected]
https://mail.mozilla.org/listinfo/es-discuss

Internationalization: Comments on Text Segmentation straw man

Reply via email to