On Mon, Sep 19, 2016 at 03:30:09AM -0700, Colomban Wendling wrote:
> It's pretty messy but fair enough.  However, we probably won't do
> that, because being able to have a fixed encoding in the data we load
> means that we have to handle encoding conversion in a single place,
> instead of everywhere something touches the data -- and there are a
> lot of code that does that, it's and editor after all.

> Also, as UTF-8 can represent virtually any textual data (anything
> inside Unicode), it would only help with invalid input (like here) or
> binary data (which probably would better be handled with a hex
> filter).  So I'm afraid it won't happen.

> If someone has a nice solution though, I'd love to be proven wrong.

Well, thinking about it, if it was a wanted feature, I would do this as

- have the raw valid text as a UTF-8 (of course) "linear array"
  (might be a window onto disk for large files, etc)

- indexing layers above this, to quickly identify graphemes, word
  boundaries, line boundaries and any other points of interest, such as:

- "invalid bytes insertion points" along with the corresponding invalid
  byte sequences
  - this way, those parts of the program (most of them) that don't want
    or need to handle invalid bytes, don't have to
  - and you have an easy index to re-insert the invalid sequences on
    saving, or some display/view onto the file that can represent
    invalid bytes
  - and you can offer easy options to the user such as "save without
    invalid bytes" "encode invalid bytes according to some format" etc

Should be easy, and should also be how the program is implemented.

At least, that's how a superior programmer would implement it ;)

See for reference:

Regards, and thanks again,

You are receiving this because you are subscribed to this thread.
Reply to this email directly or view it on GitHub:

Reply via email to