Re: [GENERAL] finding rows with invalid characters

Sim Zacks Tue, 23 Nov 2010 00:41:51 -0800

Thanks for your help.

I think a trigger will actually be the easiest. The way i can tell ifthere is invalid data is simply to do an Upper(text) and if it hasinvalid data it fails.

I dumped the fixed database. Now I have a years worth of backups that Ican't restore a specific table from. It most probably will never meananything. Every once in a while, I get asked to check what was in the dbagainst what is in there, but this table will probably never be audited.



Sim



On 11/23/2010 10:33 AM, Dmitriy Igrishin wrote:

Hey Sim,

2010/11/23 Sim Zacks <[email protected] <mailto:[email protected]>>

    On 11/21/2010 05:55 PM, Dmitriy Igrishin wrote:

    Hey Sim,

    Maybe this helps:
    http://blog.tapoueh.org/articles/blog/_Getting_out_of_SQL_ASCII,_part_2.html

    That worked to find some of them. I still needed to find a bunch
    of others manually, such as 0xa0 and 0xd725 which weren't found
    with that function. I finally figured out that
    select * from emaildetails where emailbody like '%\xa0%' and
    select * from emaildetails where emailbody like '%\xd7\x25%' would
    show me all those rows.

    My 2 big problems now are:

    A) how to make sure that these chars are not inserted in the
    future. The database should prevent them from being inserted.

Consider to use domains -- generic-based types with constraints --instead of generic types.

Use regular expressions in constraints.

As alternative, you can use triggers for more complex validation. Butdomains in you case

IMO will work good.


    B) How to fix the backups that I have so that I can restore them.
    As I mentioned, they are being taken with -Fc

Oops. Why not dump fixed database ?


    Sim

    2010/11/21 Sim Zacks <[email protected] <mailto:[email protected]>>

        I am using PG 8.2.17 with UTF8 encoding.
        "PostgreSQL 8.2.17 on i686-pc-linux-gnu, compiled by GCC gcc
        (GCC) 4.1.1 (Gentoo 4.1.1)"

        One of my tables somehow has invalid characters in it:

            ERROR:  invalid byte sequence for encoding "UTF8": 0xa9
            HINT:  This error can also happen if the byte sequence
            does not match the encoding expected by the server, which
            is controlled by "client_encoding".

        I have already manually found a number of the bad rows by
        running queries with text functions (upper) between groups of
        IDs until I found the specific bad row.

        1) Is there a quicker way to get a list of all rows with
        invalid characters
        2) Shouldn't the database prevent these rows from being
        entered in the first place?
        3) I have backups of this database (using -Fc) and I noticed
        that on restore, this table is not restored because of this
        error. Is there a way to fix the existing backups, or tell
        the restore to ignore bad rows instead of erroring out the
        whole table?

        Thanks
        Sim

--Sent via pgsql-general mailing list

        ([email protected]
        <mailto:[email protected]>)
        To make changes to your subscription:
        http://www.postgresql.org/mailpref/pgsql-general

--// Dmitriy.





--
// Dmitriy.

Re: [GENERAL] finding rows with invalid characters

Reply via email to