Hi.

Introduction
============

PostgreSQL full-text search extension uses dictionaries from the various open source spell checker software to perform word normalization.

Currently, Ispell, MySpell and Hunspell dictionaries are supported.

Dictionaries requires two files: a dictionary file and an affix file. A dictionary file contains a list of words. Each word may be followed by one or more affix flags. An affix file contains a lot of parameters, definitions, prefix and suffix classes used in a dictionary file.

Most complete and actively developed are Hunspell dictionaries (http://hunspell.sourceforge.net/). OpenOffice and LibreOffice projects recently switched from MySpell to Hunspell dictionaries.

But PostgreSQL is unable to load recent version of Hunsplell dictionaries for several languages.

It is because affix files of these dictionaries grow too big. Traditionally affix rules are named by one extended ASCII (8-bit) symbol. And if there is more than 192 rules, some syntax extension is needed.

And to handle these dictionaries Hunspell have FLAG parameter with the following values:
* FLAG long - sets the double extended ASCII character flag type
* FLAG num - sets the decimal number flag type (from 1 to 65000)

These flag types are used in affix files of such dictionaries as ar, br_fr, ca, ca_valencia, da_dk, en_ca, en_gb, en_us, fr, gl_es, is, ne_np, nl_nl, si_lk (from http://cgit.freedesktop.org/libreoffice/dictionaries/tree/). But PostgreSQL does not support FLAG parameter and can not load these dictionaries.

There is also AF parameter which allows to substitute affix flag sets with ordinal numbers in affix and dictionary files.

FLAG and AF parameters are not supported by PostgreSQL. Supporting these parameters allows to load dictionaries listed above into PostgreSQL database and use them in full text search.

Proposed Changes
================

Internal representation of the dictionary in the PostgreSQL doesn't impose too strict limits on the number of affix rules. There are a flagval array, which size must be increased from 256 to 65000.

All other changes is the changes in the affix file parsing code to properly parse long and numeric flags.

I've already implemented support for FLAG long, it require relatively small patch size (60 lines). Support for FLAG num would require comparable amount of code.

These changes would allow to use recent versions of Hunspell dictionaries for following dictionaries:
br_fr, ca, ca_valencia, da_dk, gl_es, is, ne_np, nl_nl, si_lk.

Implementation of AF flag would allow to support also following dictionaries:
ar, en_ca, en_gb, en_us, fr, hu_hu.

Expected Results
================

These changes would allow to use more recent and complete spelling dictionaries to perform word stemming during full-text indexing.

--
Artur Zakirov
Postgres Professional: http://www.postgrespro.com
Russian Postgres Company


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Reply via email to