Gah! Accidentally hit Send. Let me finish that last message before
sending this time!


G'day list.

I've been messing around with the unaccent extension and I've noticed
that some of the characters listed in the unaccent.rules file aren't
actually being unaccented on my system.

Here are the system details and whatnot.

- OSX 10.7.2

- the server is compiled via macports. Tried using both gcc and llvm
4.2.1 compilers that come with the latest version of XCode.

- the same symptoms show up in both 9.0.5 and 9.1.1. I've also tried
building manually from the latest REL9_1_STABLE branch from git to
make sure macports wasn't the problem, but I'm getting the same
results with both compilers.

When I first do a CREATE EXTENSION for unaccent, I'm seeing the
following warnings in the log file:

===
WARNING:  duplicate TO argument, use first one
CONTEXT:  line 8 of configuration file
"/usr/local/postgresql91-local/share/tsearch_data/unaccent.rules":
"à      a
       "
WARNING:  duplicate TO argument, use first one
CONTEXT:  line 57 of configuration file
"/usr/local/postgresql91-local/share/tsearch_data/unaccent.rules":
"Ġ      G
       "
WARNING:  duplicate TO argument, use first one
CONTEXT:  line 144 of configuration file
"/usr/local/postgresql91-local/share/tsearch_data/unaccent.rules":
"Š      S
       "
===

I've dug around through the unaccent.c code a bit and I've noticed
that the sscanf it does when reading the file is producing some odd
output. I've tried with a minimal example using the same sort of
sscanf code reading from the same unaccent.rules file, but the minimal
example doesn't produce the same output.

I put some elog debugging lines into unaccent.c and found that sscanf
sometimes reads the scanned line by finding only one byte for the for
the source character rather than the two required for the complete
UTF-8 code point. It appears that the following characters are causing
the problem, along with the code points and such:

'Å' => 'A' | c3,85 => 41
'à' => 'a' | c3,a0 => 61
'ą' => 'a' | c4,85 => 61
'Ġ' => 'G' | c4,a0 => 47
'Ņ' => 'N' | c5,85 => 4e
'Š' => 'S' | c5,a0 => 53

In each case, one byte was being read in the source string rather than
two, leading to the "duplicate TO" warnings above. This later leads to
the characters that produced the warning being ignored when unaccent
is called and left in the output.

I haven't been able to reproduce in a smaller example, and haven't
been able to reproduce on a CentOS server, so at this point I'm at a
loss as to the problem.

Anybody got any ideas?

Cheers

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Reply via email to