Re: [GENERAL] Eliminating bad characters from a database for upgrading

Russell Smith Thu, 16 Nov 2006 19:29:15 -0800

Gregory S. Williamson wrote:

Dear list,

I have been banging my head against a problem for a few days now, and although
I am making progress it is painfully slow, and I am hoping that some one out
there can steer me in a better way.

I've got a medium size database (7.4), about 64 gigs of data, about 1/3rd of
which is in one table, which has ~32 million rows (22 gigs when dumped). This
largish table has about 20 different columns of varchar or text.

There are some records that have illegal characters in them, according to
postgres 8.1.5, which imposes stricter standards on UTF encoding.

I've been using copy to dump the big table to disk, then try to load it into my new table.
When it fails, I use split to break the file into managable chunks and then use
""vi" to find the offending line, then figure out the column. Then I use
something like:

create table bad_char_gids as select gid from parcels where position('Ñ' in
s_street) > 0;

And so create a table with the ids of the bad records; and then use replace to
either replace or eliminate the offending characters from that column. This
example got 5001 records, but often it is one record in the whole DB will have
some other offending character. I fix the problem in the loaddata as well, and
continue.

The problem is that there are errors in quite a few of the columns (but only a
few tens of thousands of records), and the offending characters are all quite
different (wierd diacritics and characters, upper and lower case). And so this
is a very slow process.

Is there any way to get a list of records, even if done repeatedly for each
column, that would let me find the offending records in 7.4 which have any
invalid UTF chars? I am feeling stupid for not seeing one ... I can find any
individual bad character, but I want to find them all at once, if possible.

Try converting the dump files encoding to UTF-8. before 8.1 you couldinsert invalid characters into the DB because it accepted otherencodings. It will also dump other encoding. For example, convertingsomething with windows characters in it.


iconv -f "WINDOWS-1251" -t "UTF-8" dump_file > converted_dump_file

And import the converted file. you may need to try a couple ofdifferent input encodings if you aren't sure what encoding was used wheninserting data into the DB.


Russell.

TIA,

Greg Williamson
DBA
GlobeXplorer LLC


---------------------------(end of broadcast)---------------------------
TIP 6: explain analyze is your friend



---------------------------(end of broadcast)---------------------------
TIP 9: In versions below 8.0, the planner will ignore your desire to
      choose an index scan if your joining column's datatypes do not
      match

Re: [GENERAL] Eliminating bad characters from a database for upgrading

Reply via email to