On 2018-10-30 10:01 am, Alexander Kobel wrote:
This makes me wonder whether the problem is in the backwards
string-search in string-skip-right or in the substring routine used in
the make-center-on-word-callback; the only reason I can imagine why
this pops up is that some blind-to-Unicode slicing cuts some string in
the middle of a multi-byte Unicode character.
It's a little of both.
Near as I can tell, proper Unicode support was only added in Guile 2.0.
So 1.8 only thinks of characters as 8-bit values. While a UTF8-encoded
string can be represented, none of the built-in character- or
string-handling routines in 1.8 understand that encoding directly.
For instance, the code you posted contains a mistake in defining the
character set for punctuation symbols. string->list will convert a
string into individual characters, but remember that 1.8 doesn't
understand anything beyond ASCII. As such, the following string:
.?-;,:„“‚‘«»‹›『』「」“”‘’–— */()[]{}|<>!`~&…†‡
gets converted into the following list:
(#\. #\? #\- #\; #\, #\: #\342 #\200 #\236 #\342 #\200 #\234
#\342 #\200 #\232 #\342 #\200 #\230 #\302 #\253 #\302 #\273
#\342 #\200 #\271 #\342 #\200 #\272 #\343 #\200 #\216 #\343
#\200 #\217 #\343 #\200 #\214 #\343 #\200 #\215 #\342 #\200
#\234 #\342 #\200 #\235 #\342 #\200 #\230 #\342 #\200 #\231
#\342 #\200 #\223 #\342 #\200 #\224 #\space #\* #\/ #\( #\)
#\[ #\] #\{ #\} #\| #\< #\> #\! #\` #\~ #\& #\342 #\200
#\246 #\342 #\200 #\240 #\342 #\200 #\241)
The resulting character set is then just the unique individual bytes,
not the original characters which may have been composed of two or more
surrogates:
#<charset {#\space #\! #\& #\( #\) #\* #\, #\- #\. #\/ #\:
#\; #\< #\> #\? #\[ #\] #\` #\{ #\| #\} #\~ #\200
#\214 #\215 #\216 #\217 #\223 #\224 #\230 #\231
#\232 #\234 #\235 #\236 #\240 #\241 #\246 #\253
#\271 #\272 #\273 #\302 #\342 #\343}>
The result is something that may at first glance appear to handle
things, since what is happening is that the logic is stripping away
individual bytes from the left and right ends of the string. When you
have a leading or trailing symbol that was in the list, then its
individual bytes are stripped properly. However, if you include a
character that just so happens to begin with or end with one of these
bytes, then it will be split improperly.
In your example, "à" is encoded as #\303 #\240. But take note of #\240
which is in the character set. It was included in the set because of
"†" which is encoded as #\342 #\200 #\240. If you were to remove "†"
from the list of symbols, you'd find that the warning will go away,
because #\240 is no longer being stripped.
Does anyone have a hint how to approach this one? (Or is the answer
just: be patient and hope for Guile v2?)
The only hint here is to replace the built-in functions with ones which
understand UTF8 encoding and can perform the work needed. There very
well might be someone online who has already done this work, which would
save on having to do it yourself.
Otherwise, the basic strategy is to replace string->list with a version
that decodes UTF8 and returns a list of integers (essentially UTF32).
Then, all of the string work is being done with these lists of integers
instead. (The character set would also just be a set of integers
representing the unique Unicode code points.) After you find the
subsets of the list that are interesting to measure, you'll then need to
convert the list back into a string. This means encoding back into UTF8
and emitting a string.
-- Aaron Hill
_______________________________________________
lilypond-user mailing list
lilypond-user@gnu.org
https://lists.gnu.org/mailman/listinfo/lilypond-user