Source database is 7.4.3, Target is 7.4.6. Source database was originally using 'SQL_ASCII' as the encoding, but all data is in UTF-8. Target database uses 'UNICODE'. The table that is giving me problems contains over 400,000 rows. (It holds category data from the dmoz project. ) In both the source and target database, there is a unique constraint on the topic key. Upon restore I am getting duplicate key violations on 12 rows. All of the problem rows contain non latin1 data.
What I've tried: ==========
1) SQL Dump / Restore
source machine: pg_dump -a -t category -U postgres dbname > cats.dump.sql
target machine: psql \i cats.dump.sql
which outputs:
\i /tmp/category.dump.sql
SET
SET
SET
SET
SET
UPDATE 1
SET
psql:/tmp/category.dump.sql:24: ERROR: duplicate key violates unique constraint "category_topic_key"
CONTEXT: COPY category, line 133302: "1227568 503988 Top/Kids_and_Teens/International/Korean/�� 0 0"
SET
UPDATE 1
( I also tried it using inserts instead of copy from, but with similar results. )
2) Binary (custom) Dump / Restore source machine: pg_dump -F c -Z 8 -t category -U postgres dbname > category_dump.custom.gz
target machine: pg_restore -d dbname -a category_dump.custom.gz
which churns for a while and then gives me this error:
pg_restore: ERROR: duplicate key violates unique constraint "category_topic_key"
CONTEXT: COPY category, line 133302: "1227568 503988 Top/Kids_and_Teens/International/Korean/�� 0 0"
pg_restore: [archiver (db)] error returned by PQendcopy
3) Remove unique constraints.
I then removed the unique constraint in the target database so I could at least import the data. After that I was able to view exactly which rows have been duplicated:
select sub.topic, sub.cnt from (select topic, count(*) as cnt from category group by topic) sub where cnt > 1;
topic | cnt
-----------------------------------------------------------------------+-----
Top/Adult/World/Japanese/���������/��� | 2
Top/Adult/World/Japanese/����������/��� /���/�� | 10
Top/Adult/World/Korean/�거 | 4
Top/Adult/World/Korean/��_����/�� | 2
Top/Adult/World/Korean/�� /문� | 2
Top/Adult/World/Korean/���� | 2
Top/Adult/World/Korean/������_X/�� | 2
Top/Kids_and_Teens/International/Japanese/��� | 2
Top/Kids_and_Teens/International/Japanese/������ | 2
Top/Kids_and_Teens/International/Japanese/����������/��� | 4
Top/Kids_and_Teens/International/Japanese/趣������/���� | 2
Top/Kids_and_Teens/International/Korean/�� | 5
(12 rows)
Performing this same query on the source database:
select sub.topic, sub.cnt from (select topic, count(*) as cnt from category group by topic) sub where cnt > 1;
topic | cnt
-------+-----
(0 rows)
4) Attempted to identify category_id of duplicate rows:
select category_id from category where topic = 'Top/Adult/World/Korean/�거';
category_id
-------------
(0 rows)
I believe this failed due to some sort of encoding or font problem between xterm and psql and DB or even X clipboard. Note that the data does display correctly when viewed in mozilla. Still it would be nice to be able to copy/paste psql result string and use it as input and actually find a match!
5) Manual inspection of one of the rows.
I chose the topic 'Top/Adult/World/Korean/�거' to pursue further. I executed the following query and looked for multiple instances of that string. There should be 4 according to our duplicates query above.
select category_id, topic from category where topic like 'Top/Adult/World/Korean%';
category_id | topic
-------------+----------------------------------------------
328048 | Top/Adult/World/Korean/���,CD
381025 | Top/Adult/World/Korean/���,CD/백�CD
400131 | Top/Adult/World/Korean/������_X
400136 | Top/Adult/World/Korean/������_X/���
400133 | Top/Adult/World/Korean/������_X/���
5830581 | Top/Adult/World/Korean/미��
5830906 | Top/Adult/World/Korean/����
589823 | Top/Adult/World/Korean/��_����/기�,�체
324253 | Top/Adult/World/Korean
367742 | Top/Adult/World/Korean/�거
378503 | Top/Adult/World/Korean/��
590650 | Top/Adult/World/Korean/�� /��
378504 | Top/Adult/World/Korean/�� /문�
590649 | Top/Adult/World/Korean/�� /� ����
5828700 | Top/Adult/World/Korean/������_X/��
5812536 | Top/Adult/World/Korean/��
5832542 | Top/Adult/World/Korean/����/취�,��
364360 | Top/Adult/World/Korean/��
324254 | Top/Adult/World/Korean/��_����
592044 | Top/Adult/World/Korean/��_����/��
5852704 | Top/Adult/World/Korean/��_����/��
406487 | Top/Adult/World/Korean/��_����/����
365940 | Top/Adult/World/Korean/����
(23 rows)
Yet I only see one row that matches the string exactly. It is the one with category_id = 367742.
6) Attempt to import same data back into source database ( 7.4.3 )
Acting on the theory that this is possibly a new problem in 7.4.6, I tried the following in the source DB, (still with SQL_ASCII encoding) which worked just fine:
create table category_tmp as select * from category;
alter table category_tmp add constraint category_tmp_topic_key unique (topic);
NOTICE: ALTER TABLE / ADD UNIQUE will create implicit index "category_tmp_topic_key" for table "category_tmp"
ALTER TABLE
pg_dump -a -t category_tmp -U postgres dbname > category_tmp_dump.sql
psql
delete from category_tmp ; DELETE 461153
\i category_tmp_dump.sql SET SET SET SET
7) Experiments with encoding
a) Used GNU recode to recode the sql dump file to UTF-8. Resulting file was unchanged, meaning data was already UTF-8.
b) Updated the encoding in source database to 'UNICODE' to match target database.
update pg_database set encoding = 6 where datname = 'dbname';
c) Re-imported the data back into the source database again (as in 6). Worked fine again.
So at this point I am mostly at a loss. I would have thought that after changing the source DB to UNICODE encoding it should exhibit the same behavior as the target. I can think of two explanations:
1: initdb does something with the encoding beyond setting pg_database(encoding).
2: there is a bug in 7.4.6 that does not exist in 7.4.3
I suppose the next step is to create a new DB in 7.4.3 using UNICODE and attempt to import the data in the same manner. But right now I need a break.
Dan Libby
---------------------------(end of broadcast)--------------------------- TIP 7: don't forget to increase your free space map settings
