On Fri, Sep 14, 2012 at 5:46 AM, Richard Hipp <d...@sqlite.org> wrote:

>
> Detection of embedded non-printing characters, especially U+0000, would be
> nice.
>
> Should we insist on a BOM at the beginning of the file?
>

I don't think a BOM should be mandatory, as it is not required by Unicode.
Another thought:

Unicode characters have a General Category. The categories Letter, Mark,
Number, Punctuation, Symbol & Separator are obviously useful in the context
of Unicode encoded text files. That leaves the "Other" categories:

1. "Other, control": Some of these are useful in the context of text files
(tab, CR, LF, FF among others which are used to format text). Some not so
much. I'd think however fossil currently classifies such characters in
8-bit text is how they should be treated for Unicode.

2. "Other, format": I believe these should be included even though I think
they'd be rare in the types of files fossil would need to diff.

3. "Other, surrogate": If these appear in an invalid context in a document
it is not well formed UTF-16.

4. "Other, not assigned [Noncharacter]": If these appear anywhere in a
document it is not well formed UTF-16.

5. "Other, private use": Should we allow these in a file that fossil might
diff? Since they are private-use there is no 'standard' way of displaying
them, but people may legitimately want or need them in their private
documents.

6. "Other, not assigned [Reserved]": These are perfect valid code points,
and over time more of them will be allocated to other categories. We can
either allow them knowing they won't be displayable now but might be later
or restrict them until some future time. Either way the code has to be kept
up to date with future Unicode revisions, as a new code could be allocated
to a printable character or (theoretically) to a non-printable control code.

I'm assuming the desire for identifying UTF-16 and converting to UTF-8 is
*only* for the purposes of recognizing them as diffable text vs binary
files (though you might also want to use this to identify a file as Unicode
text vs 8-bit text vs binary when adding them to the repository). If the
first thought is correct, it would be possible to "expand" normally
non-printable characters into printable sequences in otherwise valid UTF-16
buffers. For example:

Let's say a line of text in a file has an embedded backspace character
(U+0008). We could say it is not a text file for that reason or we could
expand the normally non-printing backspace into literal text "U+0008". That
might be more valuable in the context of diffing files, and it causes no
harm in that the file as stored in the file system or the repository
remains unchanged.

Hopefully I'm not rambling and these thoughts are sufficiently coherent as
to make sense...

SDR
_______________________________________________
fossil-users mailing list
fossil-users@lists.fossil-scm.org
http://lists.fossil-scm.org:8080/cgi-bin/mailman/listinfo/fossil-users

Reply via email to