Hi, So far, using the universal character encoding plugin and using the result to convert to UTF8 seems to give me a valid UTF8 string, even though the contents look a bit garbled.
Even though the string looks like 104-06_8P ��〡蜃貳_130111 test G7.svf104-06_8P 正印海德堡104-06_8P 正印 Note the �� - hopefully they come out as a question mark in a diamond shape I am able to store that string in a PostgreSQL database. Originally I thought those question marks were invalid UTF8 characters, but I assume because PostgresSQL does not complain about an invalid byte sequence It must be ok. I was trying to implement this regex php script http://stackoverflow.com/questions/1401317/remove-non-utf8-characters-from-string $regex = <<<'END' / ( (?: [\x00-\x7F] # single-byte sequences 0xxxxxxx | [\xC0-\xDF][\x80-\xBF] # double-byte sequences 110xxxxx 10xxxxxx | [\xE0-\xEF][\x80-\xBF]{2} # triple-byte sequences 1110xxxx 10xxxxxx * 2 | [\xF0-\xF7][\x80-\xBF]{3} # quadruple-byte sequence 11110xxx 10xxxxxx * 3 )+ # ...one or more times ) | . # anything else / x END; preg_replace ($regex, '$1', $text); But I got stuck on the hex matching part. I found the re.CompileOptionUTF8=false which got it to match a hex character, I could not get a match with multiple characters. Regards Lee On 12 Feb 2013, at 12:51, Christian Schmitz wrote: > > Am 12.02.2013 um 12:14 schrieb Lee Badham <[email protected]>: > >> More Info… >> >> To be clear, I want to keep as much of the original string as possible, so >> if most of the string is in Chinese (but valid UTF-8) that is ok. >> >> Some of the files I get have Various Chinese encodings and sometimes the >> universal parser gueses the encoding wrong. >> > > Well, it should be possible to develop an algorithm which checks UTF-8 > validity and replaces invalid sequences with a filler character. > > Greetings > Christian > > -- > Read our blog about news on our plugins: > > http://www.mbsplugins.de/ > > _______________________________________________ > Mbsplugins_monkeybreadsoftware.info mailing list > [email protected] > https://ml01.ispgateway.de/mailman/listinfo/mbsplugins_monkeybreadsoftware.info Lee Badham www.bodoni.co.uk | www.presssign.com _______________________________________________ Mbsplugins_monkeybreadsoftware.info mailing list [email protected] https://ml01.ispgateway.de/mailman/listinfo/mbsplugins_monkeybreadsoftware.info
