On 23/10/17 16:25, Rustom Mody wrote: > On Monday, October 23, 2017 at 1:15:35 PM UTC+5:30, Steve D'Aprano wrote: >> >> and more. Many linguists also include digraphs (pairs of letters) like the >> English "th", "sh", "qu", or "gh" as graphemes. >> >> >> https://www.thoughtco.com/what-is-a-grapheme-1690916 >> >> https://en.wikipedia.org/wiki/Grapheme > > Um… Ok So I am using the wrong word? Your first link says: > | For example, the word 'ghost' contains five letters and four graphemes > | ('gh,' 'o,' 's,' and 't') > > Whereas new regex findall does: > >>>> findall(r'\X', "ghost") > ['g', 'h', 'o', 's', 't'] >>>> findall(r'\X', "church") > ['c', 'h', 'u', 'r', 'c', 'h'] >
The definition of a "grapheme" in the Unicode standard does not necessarily line up with linguistic definition of grapheme for any particular language. Even if we assumed that there was a universally agreed definition of the term for every written language (for English there certainly isn't), you'd dictionaries information on which language you're dealing with to pull this trick off. As an example to illustrate why you'd need dictionaries: In Dutch, "ij" (the "long IJ", as opposed to the "greek Y") is generally considered a single letter, or at the very least a single grapheme. There is a unicode codepoint for it (ij), but it isn't widely used. So "vrij" (free) has three graphemes (v r ij) and three or four letters. However, in "bijectie" (bijection), "i" and "j" are two separate graphemes, so this word has eight letters and seven or eight graphemes. ("ie" may or may not be one single grapheme...) -- Thomas PS: This may not be obvious to you at first unless you're Dutch. -- https://mail.python.org/mailman/listinfo/python-list