Re: [GENERAL] [to_tsvector] German Compound Words
I actually wanted to minimize the installation effort. Thus, I used the hunspell-de-de package of Debian/Ubuntu. Give me a second for ispell. Below, see the hunspell variant for Produktionsintervall/Produktionintervall: =# select * from ts_debug('public.german_compound', 'Produktionsintervall'); alias | description |token | dictionaries | dictionary |lexemes ---+-+--+---+-+ asciiword | Word, all ASCII | Produktionsintervall | {german_hunspell,german_stem} | german_stem | {produktionsintervall} (1 row) =# select * from ts_debug('public.german_compound', 'Produktionintervall'); alias | description |token| dictionaries | dictionary |lexemes ---+-+-+---+-+--- asciiword | Word, all ASCII | Produktionintervall | {german_hunspell,german_stem} | german_stem | {produktionintervall} PS: I post your answer to the list as well On 28.05.2015 19:42, Oleg Bartunov wrote: For readability it's better to use select * from ts_debug I remember there is problem with correct support of hunspell files. Did you try ispell files ? Also, I found this messagehttp://www.postgresql.org/message-id/dm1ece$2gb5$1...@news.hub.org Try this word - Produktionintervall On Thu, May 28, 2015 at 6:34 PM, Sven R. Kunze srku...@tbz-pariv.de mailto:srku...@tbz-pariv.de wrote: Sure. Here you are: =# select ts_debug('public.german_compound', 'wasserkraft'); ts_debug - (asciiword,Word, all ASCII,wasserkraft,{german_hunspell,german_stem},german_stem,{wasserkraft}) =# select ts_debug('public.german_compound', 'schifffahrt'); ts_debug - (asciiword,Word, all ASCII,schifffahrt,{german_hunspell,german_stem},german_hunspell,{schifffahrt}) =# select ts_debug('public.german_compound', 'blindflansch'); ts_debug --- (asciiword,Word, all ASCII,blindflansch,{german_hunspell,german_stem},german_stem,{blindflansch}) That is my testing configuration: =# \dF+ german_compound Text search configuration public.german_compound Parser: pg_catalog.default Token |Dictionaries -+- asciihword | german_hunspell,german_stem asciiword | german_hunspell,german_stem email | simple file| simple float | simple host| simple hword | german_hunspell,german_stem hword_asciipart | german_hunspell,german_stem hword_numpart | simple hword_part | german_hunspell,german_stem int | simple numhword| simple numword | simple sfloat | simple uint| simple url | simple url_path| simple version | simple word| german_hunspell,german_stem On 28.05.2015 17:24, Oleg Bartunov wrote: ts_debug() ? =# select * from ts_debug('english', 'messages'); alias | description | token | dictionaries | dictionary | lexemes ---+-+--++--+-- asciiword | Word, all ASCII | messages | {english_stem} | english_stem | {messag} On Thu, May 28, 2015 at 2:05 PM, Sven R. Kunze srku...@tbz-pariv.de mailto:srku...@tbz-pariv.de wrote: Hi everybody, what do I need to do in order to enable compound word handling in PostgreSQL tsvector implementation? I run an Ubuntu 14.04 machine, PostgreSQL 9.3, have installed package hunspell-de-de and already created a new dictionary as described here: http://www.postgresql.org/docs/9.3/static/textsearch-dictionaries.html#TEXTSEARCH-ISPELL-DICTIONARY CREATE TEXT SEARCH DICTIONARY german_hunspell ( TEMPLATE = ispell, DictFile = de_de, AffFile = de_de, StopWords = german ); Furthermore, created a new test text search configuration (copied from german) and updated all parser parts where the german_stem dictionary is used so that it uses german_hunspell first and then german_stem. However, ts_vector still does not work for the compound words such as: wasserkraft - wasserkraft, kraft schifffahrt - schifffahrt, fahrt blindflansch -
Re: [GENERAL] [to_tsvector] German Compound Words
Alright. I got it running and used http://www.sai.msu.su/~megera/postgres/gist/tsearch/V2/ ; specifically: http://www.sai.msu.su/~megera/postgres/gist/tsearch/V2/dicts/ispell/ispell-german-compound.tar.gz Not sure where to find up-to-date/authorized the ispell dictionaries. I figured that I need to change this particular dictionary in order to avoid ion being split aways from words like produktION/konstruktION etc: =# select * from ts_debug('public.german_compound_ispell', 'konstruktion');+ alias | description |token | dictionaries | dictionary | lexemes ---+-+--+-+---+-- asciiword | Word, all ASCII | konstruktion | {german_ispell,german_stem} | german_ispell | {konstruktion,konstrukt,ion} The splitting of compound words is unfortunately not consistent (wasserkraft vs konstruktionsplan): =# select * from ts_debug('public.german_compound_ispell', 'wasserkraft'); alias | description |token| dictionaries | dictionary | lexemes ---+-+-+-+---+ asciiword | Word, all ASCII | wasserkraft | {german_ispell,german_stem} | german_ispell | {wasserkraft,wasser,kraft} =# select * from ts_debug('public.german_compound_ispell', 'konstruktionsplan'); alias | description | token | dictionaries | dictionary | lexemes ---+-+---+-+---+- asciiword | Word, all ASCII | konstruktionsplan | {german_ispell,german_stem} | german_ispell | {konstruktion,plan} Not sure how the 'sch' come to be: =# select * from ts_debug('public.german_compound_ispell', 'rundflansch'); alias | description |token| dictionaries | dictionary | lexemes ---+-+-+-+---+-- asciiword | Word, all ASCII | rundflansch | {german_ispell,german_stem} | german_ispell | {rund,flansch,rund,flan,sch} This is another funny example: =# select * from ts_debug('public.german_compound_ispell', 'datenbanken'); alias | description |token| dictionaries | dictionary | lexemes ---+-+-+-+---+- asciiword | Word, all ASCII | datenbanken | {german_ispell,german_stem} | german_ispell | {datenbank,daten,date,banken,daten,date,bank,daten,date,banken,daten,date,bank} On 01.06.2015 09:25, Sven R. Kunze wrote: I actually wanted to minimize the installation effort. Thus, I used the hunspell-de-de package of Debian/Ubuntu. Give me a second for ispell. Below, see the hunspell variant for Produktionsintervall/Produktionintervall: =# select * from ts_debug('public.german_compound', 'Produktionsintervall'); alias | description |token | dictionaries | dictionary |lexemes ---+-+--+---+-+ asciiword | Word, all ASCII | Produktionsintervall | {german_hunspell,german_stem} | german_stem | {produktionsintervall} (1 row) =# select * from ts_debug('public.german_compound', 'Produktionintervall'); alias | description |token| dictionaries | dictionary |lexemes ---+-+-+---+-+--- asciiword | Word, all ASCII | Produktionintervall | {german_hunspell,german_stem} | german_stem | {produktionintervall} PS: I post your answer to the list as well On 28.05.2015 19:42, Oleg Bartunov wrote: For readability it's better to use select * from ts_debug I remember there is problem with correct support of hunspell files. Did you try ispell files ? Also, I found this messagehttp://www.postgresql.org/message-id/dm1ece$2gb5$1...@news.hub.org Try this word - Produktionintervall On Thu, May 28, 2015 at 6:34 PM, Sven R. Kunze srku...@tbz-pariv.de mailto:srku...@tbz-pariv.de wrote: Sure. Here you are: =# select ts_debug('public.german_compound', 'wasserkraft'); ts_debug - (asciiword,Word, all ASCII,wasserkraft,{german_hunspell,german_stem},german_stem,{wasserkraft}) =# select ts_debug('public.german_compound', 'schifffahrt'); ts_debug - (asciiword,Word, all
[GENERAL] [to_tsvector] German Compound Words
Hi everybody, what do I need to do in order to enable compound word handling in PostgreSQL tsvector implementation? I run an Ubuntu 14.04 machine, PostgreSQL 9.3, have installed package hunspell-de-de and already created a new dictionary as described here: http://www.postgresql.org/docs/9.3/static/textsearch-dictionaries.html#TEXTSEARCH-ISPELL-DICTIONARY CREATE TEXT SEARCH DICTIONARY german_hunspell ( TEMPLATE = ispell, DictFile = de_de, AffFile = de_de, StopWords = german ); Furthermore, created a new test text search configuration (copied from german) and updated all parser parts where the german_stem dictionary is used so that it uses german_hunspell first and then german_stem. However, ts_vector still does not work for the compound words such as: wasserkraft - wasserkraft, kraft schifffahrt - schifffahrt, fahrt blindflansch - blindflansch, flansch etc. What have I done wrong here? -- Sven R. Kunze TBZ-PARIV GmbH, Bernsdorfer Str. 210-212, 09126 Chemnitz Tel: +49 (0)371 33714721, Fax: +49 (0)371 5347920 e-mail: srku...@tbz-pariv.de web: www.tbz-pariv.de Geschäftsführer: Dr. Reiner Wohlgemuth Sitz der Gesellschaft: Chemnitz Registergericht: Chemnitz HRB 8543 -- Sent via pgsql-general mailing list (pgsql-general@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general
Re: [GENERAL] [to_tsvector] German Compound Words
ts_debug() ? =# select * from ts_debug('english', 'messages'); alias | description | token | dictionaries | dictionary | lexemes ---+-+--++--+-- asciiword | Word, all ASCII | messages | {english_stem} | english_stem | {messag} On Thu, May 28, 2015 at 2:05 PM, Sven R. Kunze srku...@tbz-pariv.de wrote: Hi everybody, what do I need to do in order to enable compound word handling in PostgreSQL tsvector implementation? I run an Ubuntu 14.04 machine, PostgreSQL 9.3, have installed package hunspell-de-de and already created a new dictionary as described here: http://www.postgresql.org/docs/9.3/static/textsearch-dictionaries.html#TEXTSEARCH-ISPELL-DICTIONARY CREATE TEXT SEARCH DICTIONARY german_hunspell ( TEMPLATE = ispell, DictFile = de_de, AffFile = de_de, StopWords = german ); Furthermore, created a new test text search configuration (copied from german) and updated all parser parts where the german_stem dictionary is used so that it uses german_hunspell first and then german_stem. However, ts_vector still does not work for the compound words such as: wasserkraft - wasserkraft, kraft schifffahrt - schifffahrt, fahrt blindflansch - blindflansch, flansch etc. What have I done wrong here? -- Sven R. Kunze TBZ-PARIV GmbH, Bernsdorfer Str. 210-212, 09126 Chemnitz Tel: +49 (0)371 33714721, Fax: +49 (0)371 5347920 e-mail: srku...@tbz-pariv.de web: www.tbz-pariv.de Geschäftsführer: Dr. Reiner Wohlgemuth Sitz der Gesellschaft: Chemnitz Registergericht: Chemnitz HRB 8543 -- Sent via pgsql-general mailing list (pgsql-general@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general
Re: [GENERAL] [to_tsvector] German Compound Words
Sure. Here you are: =# select ts_debug('public.german_compound', 'wasserkraft'); ts_debug - (asciiword,Word, all ASCII,wasserkraft,{german_hunspell,german_stem},german_stem,{wasserkraft}) =# select ts_debug('public.german_compound', 'schifffahrt'); ts_debug - (asciiword,Word, all ASCII,schifffahrt,{german_hunspell,german_stem},german_hunspell,{schifffahrt}) =# select ts_debug('public.german_compound', 'blindflansch'); ts_debug --- (asciiword,Word, all ASCII,blindflansch,{german_hunspell,german_stem},german_stem,{blindflansch}) That is my testing configuration: =# \dF+ german_compound Text search configuration public.german_compound Parser: pg_catalog.default Token |Dictionaries -+- asciihword | german_hunspell,german_stem asciiword | german_hunspell,german_stem email | simple file| simple float | simple host| simple hword | german_hunspell,german_stem hword_asciipart | german_hunspell,german_stem hword_numpart | simple hword_part | german_hunspell,german_stem int | simple numhword| simple numword | simple sfloat | simple uint| simple url | simple url_path| simple version | simple word| german_hunspell,german_stem On 28.05.2015 17:24, Oleg Bartunov wrote: ts_debug() ? =# select * from ts_debug('english', 'messages'); alias | description | token | dictionaries | dictionary | lexemes ---+-+--++--+-- asciiword | Word, all ASCII | messages | {english_stem} | english_stem | {messag} On Thu, May 28, 2015 at 2:05 PM, Sven R. Kunze srku...@tbz-pariv.de mailto:srku...@tbz-pariv.de wrote: Hi everybody, what do I need to do in order to enable compound word handling in PostgreSQL tsvector implementation? I run an Ubuntu 14.04 machine, PostgreSQL 9.3, have installed package hunspell-de-de and already created a new dictionary as described here: http://www.postgresql.org/docs/9.3/static/textsearch-dictionaries.html#TEXTSEARCH-ISPELL-DICTIONARY CREATE TEXT SEARCH DICTIONARY german_hunspell ( TEMPLATE = ispell, DictFile = de_de, AffFile = de_de, StopWords = german ); Furthermore, created a new test text search configuration (copied from german) and updated all parser parts where the german_stem dictionary is used so that it uses german_hunspell first and then german_stem. However, ts_vector still does not work for the compound words such as: wasserkraft - wasserkraft, kraft schifffahrt - schifffahrt, fahrt blindflansch - blindflansch, flansch etc. What have I done wrong here? -- Sven R. Kunze TBZ-PARIV GmbH, Bernsdorfer Str. 210-212, 09126 Chemnitz Tel: +49 (0)371 33714721, Fax: +49 (0)371 5347920 e-mail: srku...@tbz-pariv.de mailto:srku...@tbz-pariv.de web: www.tbz-pariv.de http://www.tbz-pariv.de Geschäftsführer: Dr. Reiner Wohlgemuth Sitz der Gesellschaft: Chemnitz Registergericht: Chemnitz HRB 8543 -- Sent via pgsql-general mailing list (pgsql-general@postgresql.org mailto:pgsql-general@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general -- Sven R. Kunze TBZ-PARIV GmbH, Bernsdorfer Str. 210-212, 09126 Chemnitz Tel: +49 (0)371 33714721, Fax: +49 (0)371 5347920 e-mail: srku...@tbz-pariv.de web: www.tbz-pariv.de Geschäftsführer: Dr. Reiner Wohlgemuth Sitz der Gesellschaft: Chemnitz Registergericht: Chemnitz HRB 8543