Mike Lewis <[email protected]> writes:
> I've run into a fair amount of unicode errors when trying to copy in log
> files. Would you recommend using bytea or another data type instead of text
> or varchar... or at least copying to a staging table with bytea's and
> filtering out invalid rows when moving it to the main table?
My guess is that you're working with data that was originally
represented in UTF16, and you've used a tool that doesn't really know
what it's doing to convert to UTF8. A correct conversion has to reunite
surrogate pairs into wider-than-16-bit Unicode characters and then
encode those as single UTF8 sequences. Dunno if you can easily identify
the culprit, but fixing that conversion is the long-term solution.
(BTW, I should think that iconv or some related tool would have a
solution for fixing this miscoding; it's not an uncommon problem.)
regards, tom lane
--
Sent via pgsql-bugs mailing list ([email protected])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs