Re: Internationalization: Comments on Text Segmentation straw man

Gillam, Richard Mon, 29 Apr 2013 18:20:04 -0700

Norbert--

Finally had a chance to read this in detail and respond to it.  Sorry it too so 
long, and sorry I couldn't make it to the last ad-hoc meeting; let's just say 
things have been stressful here at work recently.  I still haven't had a chance 
to look at the minutes from the ad-hoc meeting; I hope we haven't progressed so 
far that it's not worth responding to this email anymore.


> 1) Text segmentation often works on small portions of potentially large 
> documents. For example, when the user clicks or taps on a word, an 
> application may need to find the word being selected. Only the text under the 
> click/tap location is of interest, but that text may be part of a large DOM 
> tree (not necessarily all within one node!). Using String values as input 
> means that the application has to create a string including enough context so 
> that the complete desired segment and some text outside the segment is 
> guaranteed to be included - typically a paragraph. Some libraries, including 
> ICU, accept alternative input types: iterators that can move forward and 
> (unlike ES6 iterators) backward over strings, or generic interfaces that 
> provide access to text. Should our text segmenters allow such alternative 
> input types, or is it reasonable to expect callers to construct String values?

I see your point here, and I think I agree with it.  Do we have any kind of 
precedent in JS for what an abstract interface providing access to a 
potentially large body of text might look like?  That might be a bigger design 
job than the actual segmenter interface.

> 2) The strawman includes an extension to String.prototype.split, allowing it 
> to accept a TextSegmenter as the separator argument. The only way to detect 
> whether this argument will be understood would be indirect, by checking 
> whether Intl.TextSegmenter exists. Is that acceptable?

I think I'd be cool with that.  Is there a consensus on this issue?

> 3) It would be useful to have more detail on the use cases, including how the 
> text to be segmented might be represented, and whether segments would 
> typically be accessed sequentially or randomly. Some indication of how common 
> each use case is expected to be in JavaScript applications would also be 
> useful.

I concur, and I don't think I'm qualified to provide this.

> 4) I think we should drop paragraph breaking. Paragraphs are usually defined 
> by the document type - in plain text possibly any CR/LF combination, or two 
> consecutive such combinations, or U+2029; in HTML the text within a <p> 
> element and various other entities; etc. ES text segmenters shouldn't have to 
> know document types.

That's fine with me.

> 5) Do we expect tailored grapheme clusters to be supported and commonly used? 
> It seems to me that default grapheme clusters should be handled in regular 
> expressions, not in this special-purpose API.

Yeah, maybe you're right.

> 6) Line breaks on the other hand should be provided.

They're in there: one of the segmentType values is lineBreak (or did someone 
after me add that?).

> 7) Is numSegments() really needed? If so, it should be countSegments() or 
> such to indicate that it's a rather expensive operation.

The name change is reasonable.  You may be right that we don't need it, but it 
was kind of a pain in the butt to do with segmentContaining().

> 8) We need to define more precisely what is meant by the different segment 
> types. E.g., The notes on segmentType indicate that whitespace and 
> punctuation are returned as separate words - is that generally accepted?

I was basing this on my experience with ICU.  The ICU BreakIterator class 
didn't give you a way to have stuff between the segments, so the things between 
the boundaries the word-break iterator returned might or might not be "words."  
I think it treated individual punctuation marks as "words" unto themselves and 
runs of whitespace as single "words."

> Should sequences of punctuation or whitespace be treated as one word or 
> multiple?

See above; I think ICU treats runs of whitespace as "words," but single 
punctuation marks an symbols as "words."

> Will line breaking report U+00AD as a breaking opportunity?

I would assume so; yes.

> Is it safe to assume that in the absence of U+00AD the segments reported are 
> appropriate as input to a separate hyphenation engine, or do such engines 
> need more context?

I don't know; I would assume this depends on the sophistication of the engine, 
and that more sophisticated engines would need more context, possibly the whole 
paragraph.

> Normative references to UTRs 14 and 29 might help.

Agreed.

> 9) Why isn't segmentType just a property of the object returned by 
> segmentContaining?

Duh.  That's definitely better than what I have.

> 10) Should positions be based on UTF-16 code units or Unicode code points? 
> Code points seem more logical, but make it more difficult to map to 
> underlying strings.

Aren't we still indexing strings based on UTF-16 code units?  If so, I'd think 
that's what you'd have to use here.  Regardless, I think the numbers have to be 
whatever we're normally indexing strings with.

> 11) Should the end property of the object returned by segmentContaining be 
> the index of the last character/code unit included in the segment, or the 
> index after that character/code unit? The latter would be more compatible 
> with String.prototype.substring.

It should be the first index after the end of the segment so that it works with 
substring().

> 12) If the second edition of ECMA-402 is based on ES6, this API should 
> provide iterators:
> http://wiki.ecmascript.org/doku.php?id=harmony:iterators

Okay.

> 13) "Anything else defaults to “word”." should be "Anything else results in a 
> RangeError exception."

Okay.

--Rich

_______________________________________________
es-discuss mailing list
[email protected]
https://mail.mozilla.org/listinfo/es-discuss

Re: Internationalization: Comments on Text Segmentation straw man

Reply via email to