Gah! Accidentally hit Send. Let me finish that last message before sending this time!
G'day list. I've been messing around with the unaccent extension and I've noticed that some of the characters listed in the unaccent.rules file aren't actually being unaccented on my system. Here are the system details and whatnot. - OSX 10.7.2 - the server is compiled via macports. Tried using both gcc and llvm 4.2.1 compilers that come with the latest version of XCode. - the same symptoms show up in both 9.0.5 and 9.1.1. I've also tried building manually from the latest REL9_1_STABLE branch from git to make sure macports wasn't the problem, but I'm getting the same results with both compilers. When I first do a CREATE EXTENSION for unaccent, I'm seeing the following warnings in the log file: === WARNING: duplicate TO argument, use first one CONTEXT: line 8 of configuration file "/usr/local/postgresql91-local/share/tsearch_data/unaccent.rules": "à a " WARNING: duplicate TO argument, use first one CONTEXT: line 57 of configuration file "/usr/local/postgresql91-local/share/tsearch_data/unaccent.rules": "Ġ G " WARNING: duplicate TO argument, use first one CONTEXT: line 144 of configuration file "/usr/local/postgresql91-local/share/tsearch_data/unaccent.rules": "Š S " === I've dug around through the unaccent.c code a bit and I've noticed that the sscanf it does when reading the file is producing some odd output. I've tried with a minimal example using the same sort of sscanf code reading from the same unaccent.rules file, but the minimal example doesn't produce the same output. I put some elog debugging lines into unaccent.c and found that sscanf sometimes reads the scanned line by finding only one byte for the for the source character rather than the two required for the complete UTF-8 code point. It appears that the following characters are causing the problem, along with the code points and such: 'Å' => 'A' | c3,85 => 41 'à' => 'a' | c3,a0 => 61 'ą' => 'a' | c4,85 => 61 'Ġ' => 'G' | c4,a0 => 47 'Ņ' => 'N' | c5,85 => 4e 'Š' => 'S' | c5,a0 => 53 In each case, one byte was being read in the source string rather than two, leading to the "duplicate TO" warnings above. This later leads to the characters that produced the warning being ignored when unaccent is called and left in the output. I haven't been able to reproduce in a smaller example, and haven't been able to reproduce on a CentOS server, so at this point I'm at a loss as to the problem. Anybody got any ideas? Cheers -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers