From: Trausti Thor Johannsson <[EMAIL PROTECTED]>
Date: Fri, 21 Jul 2006 15:53:41 +0200

Is there any way for me to check and see if a text file is "safe to
display" ?  That is, it would not be a picture inside the text file ?
not encrypted and pretty much, just a plain text file ?

To complicate matters, the file would be Unicode and so forth.

Actually, that could simplify matters :)

Unicode has many unallowed code points, and some serialisations that are also unallowed. UTF-8 especially is very easy to verify if it is good UTF-8. UTF-16 much less so.

My ElfData plugin has a function .Scan_Verify, which returns an integer of the first bad byte in the UTF-8 string. If the entire string is good, .Scan_Verify returns 0.

I'd imagine that almost no picture or other media file will validate as UTF-8.

ElfData.Scan_Verify doesn't check for byte 0, however. Character 0 is actually a valid Unicode character, although it is a non-textual character. Unicode doesn't say that control codes can't be used in a Unicode string.

.Scan_Verify only validates text according to the Unicode standard of what a Unicode string should be, it doesn't validate it according to what we think a piece of text should be. Text doesn't contain control codes usually, except for LF, CF and TAB.

--
http://elfdata.com/plugin/



_______________________________________________
Unsubscribe or switch delivery mode:
<http://www.realsoftware.com/support/listmanager/>

Search the archives of this list here:
<http://support.realsoftware.com/listarchives/lists.html>

Reply via email to