Re: UTF-8-aware backwards string searching in Guile, or: fixing centered lyrics ignoring punctuation

Aaron Hill Tue, 30 Oct 2018 16:58:01 -0700

On 2018-10-30 10:01 am, Alexander Kobel wrote:

This makes me wonder whether the problem is in the backwards
string-search in string-skip-right or in the substring routine used in
the make-center-on-word-callback; the only reason I can imagine why
this pops up is that some blind-to-Unicode slicing cuts some string in
the middle of a multi-byte Unicode character.


It's a little of both.

Near as I can tell, proper Unicode support was only added in Guile 2.0.So 1.8 only thinks of characters as 8-bit values. While a UTF8-encodedstring can be represented, none of the built-in character- orstring-handling routines in 1.8 understand that encoding directly.

For instance, the code you posted contains a mistake in defining thecharacter set for punctuation symbols. string->list will convert astring into individual characters, but remember that 1.8 doesn'tunderstand anything beyond ASCII. As such, the following string:


    .?-;,:„“‚‘«»‹›『』「」“”‘’–— */()[]{}|<>!`~&…†‡

gets converted into the following list:

    (#\. #\? #\- #\; #\, #\: #\342 #\200 #\236 #\342 #\200 #\234
     #\342 #\200 #\232 #\342 #\200 #\230 #\302 #\253 #\302 #\273
     #\342 #\200 #\271 #\342 #\200 #\272 #\343 #\200 #\216 #\343
     #\200 #\217 #\343 #\200 #\214 #\343 #\200 #\215 #\342 #\200
     #\234 #\342 #\200 #\235 #\342 #\200 #\230 #\342 #\200 #\231
     #\342 #\200 #\223 #\342 #\200 #\224 #\space #\* #\/ #\( #\)
     #\[ #\] #\{ #\} #\| #\< #\> #\! #\` #\~ #\& #\342 #\200
     #\246 #\342 #\200 #\240 #\342 #\200 #\241)

The resulting character set is then just the unique individual bytes,not the original characters which may have been composed of two or moresurrogates:


    #<charset {#\space #\! #\& #\( #\) #\* #\, #\- #\. #\/ #\:
               #\; #\< #\> #\? #\[ #\] #\` #\{ #\| #\} #\~ #\200
               #\214 #\215 #\216 #\217 #\223 #\224 #\230 #\231
               #\232 #\234 #\235 #\236 #\240 #\241 #\246 #\253
               #\271 #\272 #\273 #\302 #\342 #\343}>

The result is something that may at first glance appear to handlethings, since what is happening is that the logic is stripping awayindividual bytes from the left and right ends of the string. When youhave a leading or trailing symbol that was in the list, then itsindividual bytes are stripped properly. However, if you include acharacter that just so happens to begin with or end with one of thesebytes, then it will be split improperly.

In your example, "à" is encoded as #\303 #\240. But take note of #\240which is in the character set. It was included in the set because of"†" which is encoded as #\342 #\200 #\240. If you were to remove "†"from the list of symbols, you'd find that the warning will go away,because #\240 is no longer being stripped.

Does anyone have a hint how to approach this one? (Or is the answer
just: be patient and hope for Guile v2?)

The only hint here is to replace the built-in functions with ones whichunderstand UTF8 encoding and can perform the work needed. There verywell might be someone online who has already done this work, which wouldsave on having to do it yourself.

Otherwise, the basic strategy is to replace string->list with a versionthat decodes UTF8 and returns a list of integers (essentially UTF32).Then, all of the string work is being done with these lists of integersinstead. (The character set would also just be a set of integersrepresenting the unique Unicode code points.) After you find thesubsets of the list that are interesting to measure, you'll then need toconvert the list back into a string. This means encoding back into UTF8and emitting a string.



-- Aaron Hill

_______________________________________________
lilypond-user mailing list
lilypond-user@gnu.org
https://lists.gnu.org/mailman/listinfo/lilypond-user

Re: UTF-8-aware backwards string searching in Guile, or: fixing centered lyrics ignoring punctuation

Reply via email to