On 27 September 2017 at 16:16,  <p...@cpan.org> wrote:
> On Tuesday 26 September 2017 14:20:33 Night Light wrote:
>> That's a nifty function. Good to know that it can be reversed.
>
> UTF-8 encode is a function which for any number from the range
> 0..1114111 assign unique sequence of the numbers 0..255.
>
> Therefore this function has a well defined inverse - UTF-8 decode
> function.
>
> As a sequence of numbers from the range 0..1114111 via UTF-8 encode
> function produce sequence of the numbers in range 0..255 (length of
> sequence would be larger) it can be again used as as input for the UTF-8
> encode function.
>
> And because output from the UTF-8 encode has well defined inverse, you
> can easily reconstruct also inverse of the composition of the more UTF-8
> functions.
>
> Take string $str and following pass:
>
> decode('UTF-8', decode('UTF-8', encode('UTF-8', encode('UTF-8', $str)))) eq 
> $str;
>
> To have exactly correct result, you just need to know how many times you
> composed repeated call to UTF-8 encode function.

And in practice you don't need to know this at all, as once you
encounter a byte sequence that is not valid UTF8 you know you are
done. The "seekable" nature of the octets in utf8 means this type of
heuristic has a relatively low-error rate, all your data has to have
is two bytes with the top two bits set in a row, or a byte with the
top two bits next to a byte without the top bit set, or various other
combinations. Ending up with valid utf8 data without it being actual
utf8 data is IMO extremely unlikely in sane coding scenarios.

I have used that function to clean up multiply encoded data a number
of times on DB's that have been affected by this kind of encoding bug,
and I have never encountered a scenario where I needed to know how
many times it was encoded, nor where data was corrupted.

Of course I recognize that this is a *heuristic*, but unless you are
doing crazy things, or *mega*unlucky you are not going to corrupt data
with a function like that. UTF8 is too well designed for that.

Yves

-- 
perl -Mre=debug -e "/just|another|perl|hacker/"

Reply via email to