Re: [GENERAL] [to_tsvector] German Compound Words

2015-06-01 Thread Sven R. Kunze
I actually wanted to minimize the installation effort. Thus, I used the 
hunspell-de-de package of Debian/Ubuntu.


Give me a second for ispell.

Below, see the hunspell variant for 
Produktionsintervall/Produktionintervall:


=# select * from ts_debug('public.german_compound', 'Produktionsintervall');
   alias   |   description   |token | 
dictionaries  | dictionary  |lexemes

---+-+--+---+-+
 asciiword | Word, all ASCII | Produktionsintervall | 
{german_hunspell,german_stem} | german_stem | {produktionsintervall}

(1 row)

=# select * from ts_debug('public.german_compound', 'Produktionintervall');
   alias   |   description   |token| 
dictionaries  | dictionary  |lexemes

---+-+-+---+-+---
 asciiword | Word, all ASCII | Produktionintervall | 
{german_hunspell,german_stem} | german_stem | {produktionintervall}




PS: I post your answer to the list as well

On 28.05.2015 19:42, Oleg Bartunov wrote:

For readability it's better to use

select * from ts_debug

I remember there is problem with correct support of hunspell files. 
Did you try ispell files ?

Also, I found this 
messagehttp://www.postgresql.org/message-id/dm1ece$2gb5$1...@news.hub.org

Try this word - Produktionintervall


On Thu, May 28, 2015 at 6:34 PM, Sven R. Kunze srku...@tbz-pariv.de 
mailto:srku...@tbz-pariv.de wrote:


Sure. Here you are:

=# select ts_debug('public.german_compound', 'wasserkraft');
ts_debug

-
 (asciiword,Word, all

ASCII,wasserkraft,{german_hunspell,german_stem},german_stem,{wasserkraft})

=# select ts_debug('public.german_compound', 'schifffahrt');
ts_debug

-
 (asciiword,Word, all

ASCII,schifffahrt,{german_hunspell,german_stem},german_hunspell,{schifffahrt})

=# select ts_debug('public.german_compound', 'blindflansch');
ts_debug

---
 (asciiword,Word, all

ASCII,blindflansch,{german_hunspell,german_stem},german_stem,{blindflansch})

That is my testing configuration:

=# \dF+ german_compound
Text search configuration public.german_compound
Parser: pg_catalog.default
  Token  |Dictionaries
-+-
 asciihword  | german_hunspell,german_stem
 asciiword   | german_hunspell,german_stem
 email   | simple
 file| simple
 float   | simple
 host| simple
 hword   | german_hunspell,german_stem
 hword_asciipart | german_hunspell,german_stem
 hword_numpart   | simple
 hword_part  | german_hunspell,german_stem
 int | simple
 numhword| simple
 numword | simple
 sfloat  | simple
 uint| simple
 url | simple
 url_path| simple
 version | simple
 word| german_hunspell,german_stem


On 28.05.2015 17:24, Oleg Bartunov wrote:

ts_debug() ?

=# select * from ts_debug('english', 'messages');
   alias   |   description   |  token   | dictionaries  | 
dictionary  | lexemes


---+-+--++--+--
 asciiword | Word, all ASCII | messages | {english_stem} |
english_stem | {messag}


On Thu, May 28, 2015 at 2:05 PM, Sven R. Kunze
srku...@tbz-pariv.de mailto:srku...@tbz-pariv.de wrote:

Hi everybody,

what do I need to do in order to enable compound word
handling in PostgreSQL tsvector implementation?

I run an Ubuntu 14.04 machine, PostgreSQL 9.3, have installed
package hunspell-de-de and already created a new dictionary
as described here:

http://www.postgresql.org/docs/9.3/static/textsearch-dictionaries.html#TEXTSEARCH-ISPELL-DICTIONARY

CREATE TEXT SEARCH DICTIONARY german_hunspell (
TEMPLATE = ispell,
DictFile = de_de,
AffFile = de_de,
StopWords = german
);

Furthermore, created a new test text search configuration
(copied from german) and updated all parser parts where the
german_stem dictionary is used so that it uses
german_hunspell first and then german_stem.

However, ts_vector still does not work for the compound words
such as:

wasserkraft - wasserkraft, kraft
schifffahrt - schifffahrt, fahrt
blindflansch - 

Re: [GENERAL] [to_tsvector] German Compound Words

2015-06-01 Thread Sven R. Kunze
Alright. I got it running and used 
http://www.sai.msu.su/~megera/postgres/gist/tsearch/V2/ ; specifically: 
http://www.sai.msu.su/~megera/postgres/gist/tsearch/V2/dicts/ispell/ispell-german-compound.tar.gz


Not sure where to find up-to-date/authorized the ispell dictionaries. I 
figured that I need to change this particular dictionary in order to 
avoid ion being split aways from words like produktION/konstruktION etc:


=# select * from ts_debug('public.german_compound_ispell', 'konstruktion');+
   alias   |   description   |token | dictionaries |  
dictionary   | lexemes

---+-+--+-+---+--
 asciiword | Word, all ASCII | konstruktion | 
{german_ispell,german_stem} | german_ispell | {konstruktion,konstrukt,ion}



The splitting of compound words is unfortunately not consistent 
(wasserkraft vs konstruktionsplan):


=# select * from ts_debug('public.german_compound_ispell', 'wasserkraft');
   alias   |   description   |token| dictionaries |  
dictionary   |  lexemes

---+-+-+-+---+
 asciiword | Word, all ASCII | wasserkraft | 
{german_ispell,german_stem} | german_ispell | {wasserkraft,wasser,kraft}


=# select * from ts_debug('public.german_compound_ispell', 
'konstruktionsplan');
   alias   |   description   |   token   | dictionaries 
|  dictionary   |   lexemes

---+-+---+-+---+-
 asciiword | Word, all ASCII | konstruktionsplan | 
{german_ispell,german_stem} | german_ispell | {konstruktion,plan}



Not sure how the 'sch' come to be:

=# select * from ts_debug('public.german_compound_ispell', 'rundflansch');
   alias   |   description   |token| dictionaries |  
dictionary   | lexemes

---+-+-+-+---+--
 asciiword | Word, all ASCII | rundflansch | 
{german_ispell,german_stem} | german_ispell | {rund,flansch,rund,flan,sch}



This is another funny example:

=# select * from ts_debug('public.german_compound_ispell', 'datenbanken');
   alias   |   description   |token| dictionaries |  
dictionary | lexemes

---+-+-+-+---+-
 asciiword | Word, all ASCII | datenbanken | 
{german_ispell,german_stem} | german_ispell | 
{datenbank,daten,date,banken,daten,date,bank,daten,date,banken,daten,date,bank}




On 01.06.2015 09:25, Sven R. Kunze wrote:
I actually wanted to minimize the installation effort. Thus, I used 
the hunspell-de-de package of Debian/Ubuntu.


Give me a second for ispell.

Below, see the hunspell variant for 
Produktionsintervall/Produktionintervall:


=# select * from ts_debug('public.german_compound', 
'Produktionsintervall');
   alias   |   description   |token | 
dictionaries  | dictionary  |lexemes

---+-+--+---+-+
 asciiword | Word, all ASCII | Produktionsintervall | 
{german_hunspell,german_stem} | german_stem | {produktionsintervall}

(1 row)

=# select * from ts_debug('public.german_compound', 
'Produktionintervall');
   alias   |   description   |token| 
dictionaries  | dictionary  |lexemes

---+-+-+---+-+---
 asciiword | Word, all ASCII | Produktionintervall | 
{german_hunspell,german_stem} | german_stem | {produktionintervall}




PS: I post your answer to the list as well

On 28.05.2015 19:42, Oleg Bartunov wrote:

For readability it's better to use

select * from ts_debug

I remember there is problem with correct support of hunspell files. 
Did you try ispell files ?

Also, I found this 
messagehttp://www.postgresql.org/message-id/dm1ece$2gb5$1...@news.hub.org

Try this word - Produktionintervall


On Thu, May 28, 2015 at 6:34 PM, Sven R. Kunze srku...@tbz-pariv.de 
mailto:srku...@tbz-pariv.de wrote:


Sure. Here you are:

=# select ts_debug('public.german_compound', 'wasserkraft');
ts_debug

-
 (asciiword,Word, all

ASCII,wasserkraft,{german_hunspell,german_stem},german_stem,{wasserkraft})

=# select ts_debug('public.german_compound', 'schifffahrt');
ts_debug

-
 (asciiword,Word, all


[GENERAL] [to_tsvector] German Compound Words

2015-05-28 Thread Sven R. Kunze

Hi everybody,

what do I need to do in order to enable compound word handling in 
PostgreSQL tsvector implementation?


I run an Ubuntu 14.04 machine, PostgreSQL 9.3, have installed package 
hunspell-de-de and already created a new dictionary as described here: 
http://www.postgresql.org/docs/9.3/static/textsearch-dictionaries.html#TEXTSEARCH-ISPELL-DICTIONARY


CREATE TEXT SEARCH DICTIONARY german_hunspell (
TEMPLATE = ispell,
DictFile = de_de,
AffFile = de_de,
StopWords = german
);

Furthermore, created a new test text search configuration (copied from german) 
and updated all parser parts where the german_stem dictionary is used so that 
it uses german_hunspell first and then german_stem.

However, ts_vector still does not work for the compound words such as:

wasserkraft - wasserkraft, kraft
schifffahrt - schifffahrt, fahrt
blindflansch - blindflansch, flansch

etc.


What have I done wrong here?

--
Sven R. Kunze
TBZ-PARIV GmbH, Bernsdorfer Str. 210-212, 09126 Chemnitz
Tel: +49 (0)371 33714721, Fax: +49 (0)371 5347920
e-mail: srku...@tbz-pariv.de
web: www.tbz-pariv.de

Geschäftsführer: Dr. Reiner Wohlgemuth
Sitz der Gesellschaft: Chemnitz
Registergericht: Chemnitz HRB 8543



--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] [to_tsvector] German Compound Words

2015-05-28 Thread Oleg Bartunov
ts_debug() ?

=# select * from ts_debug('english', 'messages');
   alias   |   description   |  token   |  dictionaries  |  dictionary  |
lexemes
---+-+--++--+--
 asciiword | Word, all ASCII | messages | {english_stem} | english_stem |
{messag}


On Thu, May 28, 2015 at 2:05 PM, Sven R. Kunze srku...@tbz-pariv.de wrote:

 Hi everybody,

 what do I need to do in order to enable compound word handling in
 PostgreSQL tsvector implementation?

 I run an Ubuntu 14.04 machine, PostgreSQL 9.3, have installed package
 hunspell-de-de and already created a new dictionary as described here:
 http://www.postgresql.org/docs/9.3/static/textsearch-dictionaries.html#TEXTSEARCH-ISPELL-DICTIONARY

 CREATE TEXT SEARCH DICTIONARY german_hunspell (
 TEMPLATE = ispell,
 DictFile = de_de,
 AffFile = de_de,
 StopWords = german
 );

 Furthermore, created a new test text search configuration (copied from
 german) and updated all parser parts where the german_stem dictionary is
 used so that it uses german_hunspell first and then german_stem.

 However, ts_vector still does not work for the compound words such as:

 wasserkraft - wasserkraft, kraft
 schifffahrt - schifffahrt, fahrt
 blindflansch - blindflansch, flansch

 etc.


 What have I done wrong here?

 --
 Sven R. Kunze
 TBZ-PARIV GmbH, Bernsdorfer Str. 210-212, 09126 Chemnitz
 Tel: +49 (0)371 33714721, Fax: +49 (0)371 5347920
 e-mail: srku...@tbz-pariv.de
 web: www.tbz-pariv.de

 Geschäftsführer: Dr. Reiner Wohlgemuth
 Sitz der Gesellschaft: Chemnitz
 Registergericht: Chemnitz HRB 8543



 --
 Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
 To make changes to your subscription:
 http://www.postgresql.org/mailpref/pgsql-general



Re: [GENERAL] [to_tsvector] German Compound Words

2015-05-28 Thread Sven R. Kunze

Sure. Here you are:

=# select ts_debug('public.german_compound', 'wasserkraft');
ts_debug
-
 (asciiword,Word, all 
ASCII,wasserkraft,{german_hunspell,german_stem},german_stem,{wasserkraft})


=# select ts_debug('public.german_compound', 'schifffahrt');
ts_debug
-
 (asciiword,Word, all 
ASCII,schifffahrt,{german_hunspell,german_stem},german_hunspell,{schifffahrt})


=# select ts_debug('public.german_compound', 'blindflansch');
ts_debug
---
 (asciiword,Word, all 
ASCII,blindflansch,{german_hunspell,german_stem},german_stem,{blindflansch})


That is my testing configuration:

=# \dF+ german_compound
Text search configuration public.german_compound
Parser: pg_catalog.default
  Token  |Dictionaries
-+-
 asciihword  | german_hunspell,german_stem
 asciiword   | german_hunspell,german_stem
 email   | simple
 file| simple
 float   | simple
 host| simple
 hword   | german_hunspell,german_stem
 hword_asciipart | german_hunspell,german_stem
 hword_numpart   | simple
 hword_part  | german_hunspell,german_stem
 int | simple
 numhword| simple
 numword | simple
 sfloat  | simple
 uint| simple
 url | simple
 url_path| simple
 version | simple
 word| german_hunspell,german_stem

On 28.05.2015 17:24, Oleg Bartunov wrote:

ts_debug() ?

=# select * from ts_debug('english', 'messages');
   alias   |   description   |  token   |  dictionaries  | dictionary  
| lexemes

---+-+--++--+--
 asciiword | Word, all ASCII | messages | {english_stem} | 
english_stem | {messag}



On Thu, May 28, 2015 at 2:05 PM, Sven R. Kunze srku...@tbz-pariv.de 
mailto:srku...@tbz-pariv.de wrote:


Hi everybody,

what do I need to do in order to enable compound word handling in
PostgreSQL tsvector implementation?

I run an Ubuntu 14.04 machine, PostgreSQL 9.3, have installed
package hunspell-de-de and already created a new dictionary as
described here:

http://www.postgresql.org/docs/9.3/static/textsearch-dictionaries.html#TEXTSEARCH-ISPELL-DICTIONARY

CREATE TEXT SEARCH DICTIONARY german_hunspell (
TEMPLATE = ispell,
DictFile = de_de,
AffFile = de_de,
StopWords = german
);

Furthermore, created a new test text search configuration (copied
from german) and updated all parser parts where the german_stem
dictionary is used so that it uses german_hunspell first and then
german_stem.

However, ts_vector still does not work for the compound words such as:

wasserkraft - wasserkraft, kraft
schifffahrt - schifffahrt, fahrt
blindflansch - blindflansch, flansch

etc.


What have I done wrong here?

-- 
Sven R. Kunze

TBZ-PARIV GmbH, Bernsdorfer Str. 210-212, 09126 Chemnitz
Tel: +49 (0)371 33714721, Fax: +49 (0)371 5347920
e-mail: srku...@tbz-pariv.de mailto:srku...@tbz-pariv.de
web: www.tbz-pariv.de http://www.tbz-pariv.de

Geschäftsführer: Dr. Reiner Wohlgemuth
Sitz der Gesellschaft: Chemnitz
Registergericht: Chemnitz HRB 8543



-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org

mailto:pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general





--
Sven R. Kunze
TBZ-PARIV GmbH, Bernsdorfer Str. 210-212, 09126 Chemnitz
Tel: +49 (0)371 33714721, Fax: +49 (0)371 5347920
e-mail: srku...@tbz-pariv.de
web: www.tbz-pariv.de

Geschäftsführer: Dr. Reiner Wohlgemuth
Sitz der Gesellschaft: Chemnitz
Registergericht: Chemnitz HRB 8543