On 27 September 2017 at 16:16, <p...@cpan.org> wrote: > On Tuesday 26 September 2017 14:20:33 Night Light wrote: >> That's a nifty function. Good to know that it can be reversed. > > UTF-8 encode is a function which for any number from the range > 0..1114111 assign unique sequence of the numbers 0..255. > > Therefore this function has a well defined inverse - UTF-8 decode > function. > > As a sequence of numbers from the range 0..1114111 via UTF-8 encode > function produce sequence of the numbers in range 0..255 (length of > sequence would be larger) it can be again used as as input for the UTF-8 > encode function. > > And because output from the UTF-8 encode has well defined inverse, you > can easily reconstruct also inverse of the composition of the more UTF-8 > functions. > > Take string $str and following pass: > > decode('UTF-8', decode('UTF-8', encode('UTF-8', encode('UTF-8', $str)))) eq > $str; > > To have exactly correct result, you just need to know how many times you > composed repeated call to UTF-8 encode function.
And in practice you don't need to know this at all, as once you encounter a byte sequence that is not valid UTF8 you know you are done. The "seekable" nature of the octets in utf8 means this type of heuristic has a relatively low-error rate, all your data has to have is two bytes with the top two bits set in a row, or a byte with the top two bits next to a byte without the top bit set, or various other combinations. Ending up with valid utf8 data without it being actual utf8 data is IMO extremely unlikely in sane coding scenarios. I have used that function to clean up multiply encoded data a number of times on DB's that have been affected by this kind of encoding bug, and I have never encountered a scenario where I needed to know how many times it was encoded, nor where data was corrupted. Of course I recognize that this is a *heuristic*, but unless you are doing crazy things, or *mega*unlucky you are not going to corrupt data with a function like that. UTF8 is too well designed for that. Yves -- perl -Mre=debug -e "/just|another|perl|hacker/"