Hi,

So far, using the universal character encoding plugin and using the result to 
convert to UTF8 seems to give me a valid UTF8 string, even though the contents 
look a bit garbled.

Even though the string looks like 104-06_8P ��〡蜃貳_130111 test G7.svf104-06_8P 
正印海德堡104-06_8P 正印

Note the  �� - hopefully they come out as a question mark in a diamond shape

I am able to store that string in a PostgreSQL database.

Originally I thought those question marks were invalid UTF8 characters, but I 
assume because PostgresSQL does not complain about an invalid byte sequence It 
must be ok.

I was trying to implement this regex php script

http://stackoverflow.com/questions/1401317/remove-non-utf8-characters-from-string

$regex = <<<'END'
/
(
(?: [\x00-\x7F]                 # single-byte sequences   0xxxxxxx
|   [\xC0-\xDF][\x80-\xBF]      # double-byte sequences   110xxxxx 10xxxxxx
|   [\xE0-\xEF][\x80-\xBF]{2}   # triple-byte sequences   1110xxxx 10xxxxxx * 2
|   [\xF0-\xF7][\x80-\xBF]{3}   # quadruple-byte sequence 11110xxx 10xxxxxx * 3 
)+                              # ...one or more times
)
| .                                 # anything else
/
x

END;

preg_replace
($regex, '$1', $text);

But I got stuck on the hex matching part. I found the 
re.CompileOptionUTF8=false which got it to match a hex character, I could not 
get a match with multiple characters.

Regards

Lee

On 12 Feb 2013, at 12:51, Christian Schmitz wrote:

> 
> Am 12.02.2013 um 12:14 schrieb Lee Badham <[email protected]>:
> 
>> More Info…
>> 
>> To be clear, I want to keep as much of the original string as possible, so 
>> if most of the string is in Chinese (but valid UTF-8) that is ok.
>> 
>> Some of the files I get have Various Chinese encodings and sometimes the 
>> universal parser gueses the encoding wrong.
>> 
> 
> Well, it should be possible to develop an algorithm which checks UTF-8 
> validity and replaces invalid sequences with a filler character.
> 
> Greetings
> Christian
> 
> -- 
> Read our blog about news on our plugins:
> 
> http://www.mbsplugins.de/
> 
> _______________________________________________
> Mbsplugins_monkeybreadsoftware.info mailing list
> [email protected]
> https://ml01.ispgateway.de/mailman/listinfo/mbsplugins_monkeybreadsoftware.info

Lee Badham

www.bodoni.co.uk | www.presssign.com


_______________________________________________
Mbsplugins_monkeybreadsoftware.info mailing list
[email protected]
https://ml01.ispgateway.de/mailman/listinfo/mbsplugins_monkeybreadsoftware.info

Reply via email to