Norbert-- Finally had a chance to read this in detail and respond to it. Sorry it too so long, and sorry I couldn't make it to the last ad-hoc meeting; let's just say things have been stressful here at work recently. I still haven't had a chance to look at the minutes from the ad-hoc meeting; I hope we haven't progressed so far that it's not worth responding to this email anymore.
> 1) Text segmentation often works on small portions of potentially large > documents. For example, when the user clicks or taps on a word, an > application may need to find the word being selected. Only the text under the > click/tap location is of interest, but that text may be part of a large DOM > tree (not necessarily all within one node!). Using String values as input > means that the application has to create a string including enough context so > that the complete desired segment and some text outside the segment is > guaranteed to be included - typically a paragraph. Some libraries, including > ICU, accept alternative input types: iterators that can move forward and > (unlike ES6 iterators) backward over strings, or generic interfaces that > provide access to text. Should our text segmenters allow such alternative > input types, or is it reasonable to expect callers to construct String values? I see your point here, and I think I agree with it. Do we have any kind of precedent in JS for what an abstract interface providing access to a potentially large body of text might look like? That might be a bigger design job than the actual segmenter interface. > 2) The strawman includes an extension to String.prototype.split, allowing it > to accept a TextSegmenter as the separator argument. The only way to detect > whether this argument will be understood would be indirect, by checking > whether Intl.TextSegmenter exists. Is that acceptable? I think I'd be cool with that. Is there a consensus on this issue? > 3) It would be useful to have more detail on the use cases, including how the > text to be segmented might be represented, and whether segments would > typically be accessed sequentially or randomly. Some indication of how common > each use case is expected to be in JavaScript applications would also be > useful. I concur, and I don't think I'm qualified to provide this. > 4) I think we should drop paragraph breaking. Paragraphs are usually defined > by the document type - in plain text possibly any CR/LF combination, or two > consecutive such combinations, or U+2029; in HTML the text within a <p> > element and various other entities; etc. ES text segmenters shouldn't have to > know document types. That's fine with me. > 5) Do we expect tailored grapheme clusters to be supported and commonly used? > It seems to me that default grapheme clusters should be handled in regular > expressions, not in this special-purpose API. Yeah, maybe you're right. > 6) Line breaks on the other hand should be provided. They're in there: one of the segmentType values is lineBreak (or did someone after me add that?). > 7) Is numSegments() really needed? If so, it should be countSegments() or > such to indicate that it's a rather expensive operation. The name change is reasonable. You may be right that we don't need it, but it was kind of a pain in the butt to do with segmentContaining(). > 8) We need to define more precisely what is meant by the different segment > types. E.g., The notes on segmentType indicate that whitespace and > punctuation are returned as separate words - is that generally accepted? I was basing this on my experience with ICU. The ICU BreakIterator class didn't give you a way to have stuff between the segments, so the things between the boundaries the word-break iterator returned might or might not be "words." I think it treated individual punctuation marks as "words" unto themselves and runs of whitespace as single "words." > Should sequences of punctuation or whitespace be treated as one word or > multiple? See above; I think ICU treats runs of whitespace as "words," but single punctuation marks an symbols as "words." > Will line breaking report U+00AD as a breaking opportunity? I would assume so; yes. > Is it safe to assume that in the absence of U+00AD the segments reported are > appropriate as input to a separate hyphenation engine, or do such engines > need more context? I don't know; I would assume this depends on the sophistication of the engine, and that more sophisticated engines would need more context, possibly the whole paragraph. > Normative references to UTRs 14 and 29 might help. Agreed. > 9) Why isn't segmentType just a property of the object returned by > segmentContaining? Duh. That's definitely better than what I have. > 10) Should positions be based on UTF-16 code units or Unicode code points? > Code points seem more logical, but make it more difficult to map to > underlying strings. Aren't we still indexing strings based on UTF-16 code units? If so, I'd think that's what you'd have to use here. Regardless, I think the numbers have to be whatever we're normally indexing strings with. > 11) Should the end property of the object returned by segmentContaining be > the index of the last character/code unit included in the segment, or the > index after that character/code unit? The latter would be more compatible > with String.prototype.substring. It should be the first index after the end of the segment so that it works with substring(). > 12) If the second edition of ECMA-402 is based on ES6, this API should > provide iterators: > http://wiki.ecmascript.org/doku.php?id=harmony:iterators Okay. > 13) "Anything else defaults to “word”." should be "Anything else results in a > RangeError exception." Okay. --Rich _______________________________________________ es-discuss mailing list [email protected] https://mail.mozilla.org/listinfo/es-discuss

