On Fri, Jan 07, 2011 at 02:38:49PM +0100, Andreas Delmelle wrote: > On 07 Jan 2011, at 14:17, Simon Pepping wrote: > > Hi Simon, > > > On Fri, Jan 07, 2011 at 07:31:07AM -0500, [email protected] wrote: > >> So, if no one objects, I will apply the patch as proposed. FOP will no > >> longer > >> crash, but simply show a '#' for such unassigned codepoints in the output. > >> Treating them as regular alphabetic characters seems to be safe enough for > >> the > >> time being. > > > > Would it not be better to use character FFFD, 'Replacement Character', > > �, for this? > > Interesting. In the context of linebreaking, that comes down to basically the > same thing. > > U+FFFD has linebreak class 'AI' or 'Ambiguous', which is currently also > converted to 'Alphabetic' as part of the initial conversions. > > Are you suggesting that we substitute the codepoint in the actual text > content (rather than leave it there, and further rely on the default > treatment of 'missing glyphs')?
I had not yet thought so far. I reflected on the use of '#' as the replacement character for missing glyphs. Is that not particular to FOP, and should we not conform to Unicode and use the Unicode replacement character in such situations? Really replacing the character in the text would go very far. A missing glyph is usually dependent on the chosen font, while the character itself is quite valid. In this case, however, the character itself is invalid, in the sense that the code point has not been assigned to a character in Unicode. (The bug report calls 1F7E a Greek extended character, but the Unicode chart for Greek extended characters, http://www.unicode.org/charts/PDF/U1F00.pdf, shows no character assignment for this code point.) That means that it does not even have properties, such as a linebreaking class. Using class 'Ambiguous' seems the right solution for that problem. Simon
