On Fri, Sep 14, 2012 at 5:46 AM, Richard Hipp <d...@sqlite.org> wrote:
> > Detection of embedded non-printing characters, especially U+0000, would be > nice. > > Should we insist on a BOM at the beginning of the file? > I don't think a BOM should be mandatory, as it is not required by Unicode. Another thought: Unicode characters have a General Category. The categories Letter, Mark, Number, Punctuation, Symbol & Separator are obviously useful in the context of Unicode encoded text files. That leaves the "Other" categories: 1. "Other, control": Some of these are useful in the context of text files (tab, CR, LF, FF among others which are used to format text). Some not so much. I'd think however fossil currently classifies such characters in 8-bit text is how they should be treated for Unicode. 2. "Other, format": I believe these should be included even though I think they'd be rare in the types of files fossil would need to diff. 3. "Other, surrogate": If these appear in an invalid context in a document it is not well formed UTF-16. 4. "Other, not assigned [Noncharacter]": If these appear anywhere in a document it is not well formed UTF-16. 5. "Other, private use": Should we allow these in a file that fossil might diff? Since they are private-use there is no 'standard' way of displaying them, but people may legitimately want or need them in their private documents. 6. "Other, not assigned [Reserved]": These are perfect valid code points, and over time more of them will be allocated to other categories. We can either allow them knowing they won't be displayable now but might be later or restrict them until some future time. Either way the code has to be kept up to date with future Unicode revisions, as a new code could be allocated to a printable character or (theoretically) to a non-printable control code. I'm assuming the desire for identifying UTF-16 and converting to UTF-8 is *only* for the purposes of recognizing them as diffable text vs binary files (though you might also want to use this to identify a file as Unicode text vs 8-bit text vs binary when adding them to the repository). If the first thought is correct, it would be possible to "expand" normally non-printable characters into printable sequences in otherwise valid UTF-16 buffers. For example: Let's say a line of text in a file has an embedded backspace character (U+0008). We could say it is not a text file for that reason or we could expand the normally non-printing backspace into literal text "U+0008". That might be more valuable in the context of diffing files, and it causes no harm in that the file as stored in the file system or the repository remains unchanged. Hopefully I'm not rambling and these thoughts are sufficiently coherent as to make sense... SDR
_______________________________________________ fossil-users mailing list fossil-users@lists.fossil-scm.org http://lists.fossil-scm.org:8080/cgi-bin/mailman/listinfo/fossil-users