Re: [GENERAL] [tsvector] to_tsvector called multiple times

2015-05-26 Thread Oleg Bartunov
You can ask http://snowball.tartarus.org/ for stemmer. Meanwhile,
you can have small personal dictionary (before stemmer) with such
exceptions, for example, use synonym template

system system

Oleg


On Tue, May 26, 2015 at 11:18 AM, Sven R. Kunze srku...@tbz-pariv.de
wrote:

 Hi everybody,

 the following stemming results made me curious:

 select to_tsvector('german', 'systeme');  'system':1
 select to_tsvector('german', 'systemes');  'system':1
 select to_tsvector('german', 'systems');  'system':1
 select to_tsvector('german', 'systemen');  'system':1
 select to_tsvector('german', 'system');   'syst':1


 First of all, this seems to be a bug in the German stemmer. Where can I
 fix it?

 Second, and more importantly, as I understand it, the stemmed version of a
 word should be considered normalized. That is, all other versions of that
 stem should be mapped to it as well. The interesting problem here is that
 PostgreSQL maps the stem itself ('system') to a completely different stem
 ('syst').

 Should a stem not remain stable even when to_tsvector is called on it
 multiple times?

 --
 Sven R. Kunze
 TBZ-PARIV GmbH, Bernsdorfer Str. 210-212, 09126 Chemnitz
 Tel: +49 (0)371 33714721, Fax: +49 (0)371 5347920
 e-mail: srku...@tbz-pariv.de
 web: www.tbz-pariv.de

 Geschäftsführer: Dr. Reiner Wohlgemuth
 Sitz der Gesellschaft: Chemnitz
 Registergericht: Chemnitz HRB 8543



 --
 Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
 To make changes to your subscription:
 http://www.postgresql.org/mailpref/pgsql-general



[GENERAL] [tsvector] to_tsvector called multiple times

2015-05-26 Thread Sven R. Kunze

Hi everybody,

the following stemming results made me curious:

select to_tsvector('german', 'systeme');  'system':1
select to_tsvector('german', 'systemes');  'system':1
select to_tsvector('german', 'systems');  'system':1
select to_tsvector('german', 'systemen');  'system':1
select to_tsvector('german', 'system');   'syst':1


First of all, this seems to be a bug in the German stemmer. Where can I 
fix it?


Second, and more importantly, as I understand it, the stemmed version of 
a word should be considered normalized. That is, all other versions of 
that stem should be mapped to it as well. The interesting problem here 
is that PostgreSQL maps the stem itself ('system') to a completely 
different stem ('syst').


Should a stem not remain stable even when to_tsvector is called on it 
multiple times?


--
Sven R. Kunze
TBZ-PARIV GmbH, Bernsdorfer Str. 210-212, 09126 Chemnitz
Tel: +49 (0)371 33714721, Fax: +49 (0)371 5347920
e-mail: srku...@tbz-pariv.de
web: www.tbz-pariv.de

Geschäftsführer: Dr. Reiner Wohlgemuth
Sitz der Gesellschaft: Chemnitz
Registergericht: Chemnitz HRB 8543



--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] [tsvector] to_tsvector called multiple times

2015-05-26 Thread Albe Laurenz
Sven R. Kunze wrote:
 the following stemming results made me curious:
 
 select to_tsvector('german', 'systeme');  'system':1
 select to_tsvector('german', 'systemes');  'system':1
 select to_tsvector('german', 'systems');  'system':1
 select to_tsvector('german', 'systemen');  'system':1
 select to_tsvector('german', 'system');   'syst':1
 
 
 First of all, this seems to be a bug in the German stemmer. Where can I
 fix it?

As far as I understand, the stemmer is not perfect, it is just a best
effort at German stemming.  It does not have a dictionary of valid German
words, but uses an algorithm based on only the occurring letters.

This web page describes the algorithm:
http://snowball.tartarus.org/algorithms/german/stemmer.html
I guess that the Snowball folks (and PostgreSQL) would be interested
if you could come up with a better algorithm.

In this specific case, the stemmer goes wrong because System is a
foreign word whose ending is atypical for German.  The algorithm cannot
distinguish between System and, say, lautem or bestem.

 Second, and more importantly, as I understand it, the stemmed version of
 a word should be considered normalized. That is, all other versions of
 that stem should be mapped to it as well. The interesting problem here
 is that PostgreSQL maps the stem itself ('system') to a completely
 different stem ('syst').
 
 Should a stem not remain stable even when to_tsvector is called on it
 multiple times?

That's a possible position, but consider that a stem is not necessarily
a valid German word.  If you treat it as a German word (by stemming it),
the results might not be what you desire.

For example:

test= select to_tsvector('german', 'linsen');
 to_tsvector
-
 'lins':1
(1 row)

test= select to_tsvector('german', 'lins');
 to_tsvector
-
 'lin':1
(1 row)

I guess that your real problem here is that a search for system
will not find systeme, which is indeed unfortunate.
But until somebody can come up with a better stemming algorithm, cases
like that can always occur.

Yours,
Laurenz Albe

-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] [tsvector] to_tsvector called multiple times

2015-05-26 Thread Sven R. Kunze
Thanks. It seems as if I have use snowball. So, I go ahead and post my 
issue at github.



Maybe, I have difficulties to understand the relationship/dependencies 
between all these 'maybe' available dictionary/parser/stemmer packages.


What happens if I install all packages for a single language? (hunspell, 
myspell, ispell, snowball)


Are they complementary? Do they replace each other?


 \dFd
 List of text search dictionaries
   Schema   |  Name   | Description
+-+---
 pg_catalog | danish_stem | snowball stemmer for danish language
 pg_catalog | dutch_stem  | snowball stemmer for dutch language
 pg_catalog | english_stem| snowball stemmer for english language
 pg_catalog | finnish_stem| snowball stemmer for finnish language
 pg_catalog | french_stem | snowball stemmer for french language
 pg_catalog | german_stem | snowball stemmer for german language
 pg_catalog | hungarian_stem  | snowball stemmer for hungarian language
 pg_catalog | italian_stem| snowball stemmer for italian language
 pg_catalog | norwegian_stem  | snowball stemmer for norwegian language
 pg_catalog | portuguese_stem | snowball stemmer for portuguese language
 pg_catalog | romanian_stem   | snowball stemmer for romanian language
 pg_catalog | russian_stem| snowball stemmer for russian language
 pg_catalog | simple  | simple dictionary: just lower case and 
check for stopword

 pg_catalog | spanish_stem| snowball stemmer for spanish language
 pg_catalog | swedish_stem| snowball stemmer for swedish language
 pg_catalog | turkish_stem| snowball stemmer for turkish language
(16 rows)


On 26.05.2015 12:09, Albe Laurenz wrote:

Sven R. Kunze wrote:

However, are you sure, I am using snowball? Maybe, I am reading the
documenation wrong:

test= SELECT * FROM ts_debug('german', 'system');
alias   |   description   | token  | dictionaries  | dictionary  | lexemes
---+-++---+-+-
  asciiword | Word, all ASCII | system | {german_stem} | german_stem | {syst}
(1 row)

test= \dFd german_stem
 List of text search dictionaries
Schema   |Name | Description
+-+--
  pg_catalog | german_stem | snowball stemmer for german language
(1 row)


http://www.postgresql.org/docs/9.3/static/textsearch-dictionaries.html
but it seems as it depends on which packages (ispell, hunspell, myspell,
snowball + corresponding languages) my system has installed.

Is there an easy way to determine which of these packages PostgreSQL
uses AND what for?

If you use a standard PostgreSQL distribution, you will have no ispell
dictionary (as the documentation you quote says).
You can always list all dictionaries with \dFd in psql.


Sure. That might be the problem. It occurs to me that stems (if detected
as such) should be left alone.
In case a stem is real German word, it should be stemmed to itself anyway
If not, it might help not to stem in order to avoid errors.

Yes, but that would mean that you have a way to determine from a string
whether it is a word or a stem or both, and the software does not do that.

Yours,
Laurenz Albe



Regards,

--
Sven R. Kunze
TBZ-PARIV GmbH, Bernsdorfer Str. 210-212, 09126 Chemnitz
Tel: +49 (0)371 33714721, Fax: +49 (0)371 5347920
e-mail: srku...@tbz-pariv.de
web: www.tbz-pariv.de

Geschäftsführer: Dr. Reiner Wohlgemuth
Sitz der Gesellschaft: Chemnitz
Registergericht: Chemnitz HRB 8543



--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] [tsvector] to_tsvector called multiple times

2015-05-26 Thread Sven R. Kunze

For future reference: https://github.com/snowballstem/snowball/issues/19


On 26.05.2015 12:29, Sven R. Kunze wrote:
Thanks. It seems as if I have use snowball. So, I go ahead and post my 
issue at github.



Maybe, I have difficulties to understand the relationship/dependencies 
between all these 'maybe' available dictionary/parser/stemmer packages.


What happens if I install all packages for a single language? 
(hunspell, myspell, ispell, snowball)


Are they complementary? Do they replace each other?


 \dFd
 List of text search dictionaries
   Schema   |  Name   | Description
+-+--- 


 pg_catalog | danish_stem | snowball stemmer for danish language
 pg_catalog | dutch_stem  | snowball stemmer for dutch language
 pg_catalog | english_stem| snowball stemmer for english language
 pg_catalog | finnish_stem| snowball stemmer for finnish language
 pg_catalog | french_stem | snowball stemmer for french language
 pg_catalog | german_stem | snowball stemmer for german language
 pg_catalog | hungarian_stem  | snowball stemmer for hungarian language
 pg_catalog | italian_stem| snowball stemmer for italian language
 pg_catalog | norwegian_stem  | snowball stemmer for norwegian language
 pg_catalog | portuguese_stem | snowball stemmer for portuguese language
 pg_catalog | romanian_stem   | snowball stemmer for romanian language
 pg_catalog | russian_stem| snowball stemmer for russian language
 pg_catalog | simple  | simple dictionary: just lower case and 
check for stopword

 pg_catalog | spanish_stem| snowball stemmer for spanish language
 pg_catalog | swedish_stem| snowball stemmer for swedish language
 pg_catalog | turkish_stem| snowball stemmer for turkish language
(16 rows)


On 26.05.2015 12:09, Albe Laurenz wrote:

Sven R. Kunze wrote:

However, are you sure, I am using snowball? Maybe, I am reading the
documenation wrong:

test= SELECT * FROM ts_debug('german', 'system');
alias   |   description   | token  | dictionaries  | dictionary  
| lexemes
---+-++---+-+- 

  asciiword | Word, all ASCII | system | {german_stem} | german_stem 
| {syst}

(1 row)

test= \dFd german_stem
 List of text search dictionaries
Schema   |Name | Description
+-+--
  pg_catalog | german_stem | snowball stemmer for german language
(1 row)


http://www.postgresql.org/docs/9.3/static/textsearch-dictionaries.html
but it seems as it depends on which packages (ispell, hunspell, 
myspell,

snowball + corresponding languages) my system has installed.

Is there an easy way to determine which of these packages PostgreSQL
uses AND what for?

If you use a standard PostgreSQL distribution, you will have no ispell
dictionary (as the documentation you quote says).
You can always list all dictionaries with \dFd in psql.

Sure. That might be the problem. It occurs to me that stems (if 
detected

as such) should be left alone.
In case a stem is real German word, it should be stemmed to itself 
anyway

If not, it might help not to stem in order to avoid errors.

Yes, but that would mean that you have a way to determine from a string
whether it is a word or a stem or both, and the software does not do 
that.


Yours,
Laurenz Albe



Regards,




--
Sven R. Kunze
TBZ-PARIV GmbH, Bernsdorfer Str. 210-212, 09126 Chemnitz
Tel: +49 (0)371 33714721, Fax: +49 (0)371 5347920
e-mail: srku...@tbz-pariv.de
web: www.tbz-pariv.de

Geschäftsführer: Dr. Reiner Wohlgemuth
Sitz der Gesellschaft: Chemnitz
Registergericht: Chemnitz HRB 8543



--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] [tsvector] to_tsvector called multiple times

2015-05-26 Thread Sven R. Kunze
Thanks, Oleg. Unfortunately, that does not work quite well as German is 
comprised of many compound nouns.


In fact, I discovered that anomaly by searching through a 
domain-specific word table. For example: Waferhandlingsystem. There are 
many '*system' but the PostgreSQL does not allow me to have a suffix; 
only a prefix and only for to_tsquery 
(http://www.postgresql.org/docs/9.3/static/textsearch-dictionaries.html#TEXTSEARCH-SYNONYM-DICTIONARY).


Is there another possibility?


On 26.05.2015 11:05, Oleg Bartunov wrote:

You can ask http://snowball.tartarus.org/ for stemmer. Meanwhile,
you can have small personal dictionary (before stemmer) with such 
exceptions, for example, use synonym template


system system

Oleg


On Tue, May 26, 2015 at 11:18 AM, Sven R. Kunze srku...@tbz-pariv.de 
mailto:srku...@tbz-pariv.de wrote:


Hi everybody,

the following stemming results made me curious:

select to_tsvector('german', 'systeme');  'system':1
select to_tsvector('german', 'systemes');  'system':1
select to_tsvector('german', 'systems');  'system':1
select to_tsvector('german', 'systemen');  'system':1
select to_tsvector('german', 'system');   'syst':1


First of all, this seems to be a bug in the German stemmer. Where
can I fix it?

Second, and more importantly, as I understand it, the stemmed
version of a word should be considered normalized. That is, all
other versions of that stem should be mapped to it as well. The
interesting problem here is that PostgreSQL maps the stem itself
('system') to a completely different stem ('syst').

Should a stem not remain stable even when to_tsvector is called on
it multiple times?

-- 
Sven R. Kunze

TBZ-PARIV GmbH, Bernsdorfer Str. 210-212, 09126 Chemnitz
Tel: +49 (0)371 33714721, Fax: +49 (0)371 5347920
e-mail: srku...@tbz-pariv.de mailto:srku...@tbz-pariv.de
web: www.tbz-pariv.de http://www.tbz-pariv.de

Geschäftsführer: Dr. Reiner Wohlgemuth
Sitz der Gesellschaft: Chemnitz
Registergericht: Chemnitz HRB 8543



-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org

mailto:pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general





--
Sven R. Kunze
TBZ-PARIV GmbH, Bernsdorfer Str. 210-212, 09126 Chemnitz
Tel: +49 (0)371 33714721, Fax: +49 (0)371 5347920
e-mail: srku...@tbz-pariv.de
web: www.tbz-pariv.de

Geschäftsführer: Dr. Reiner Wohlgemuth
Sitz der Gesellschaft: Chemnitz
Registergericht: Chemnitz HRB 8543



Re: [GENERAL] [tsvector] to_tsvector called multiple times

2015-05-26 Thread Sven R. Kunze

Thanks Albe for that detailed answer.

On 26.05.2015 11:01, Albe Laurenz wrote:

Sven R. Kunze wrote:

the following stemming results made me curious:

select to_tsvector('german', 'systeme');  'system':1
select to_tsvector('german', 'systemes');  'system':1
select to_tsvector('german', 'systems');  'system':1
select to_tsvector('german', 'systemen');  'system':1
select to_tsvector('german', 'system');   'syst':1


First of all, this seems to be a bug in the German stemmer. Where can I
fix it?

As far as I understand, the stemmer is not perfect, it is just a best
effort at German stemming.  It does not have a dictionary of valid German
words, but uses an algorithm based on only the occurring letters.

This web page describes the algorithm:
http://snowball.tartarus.org/algorithms/german/stemmer.html
I guess that the Snowball folks (and PostgreSQL) would be interested
if you could come up with a better algorithm.


Thanks for that hint. I will go to 
https://github.com/snowballstem/snowball/issues and try to explain my 
problem there.


However, are you sure, I am using snowball? Maybe, I am reading the 
documenation wrong: 
http://www.postgresql.org/docs/9.3/static/textsearch-dictionaries.html 
but it seems as it depends on which packages (ispell, hunspell, myspell, 
snowball + corresponding languages) my system has installed.


Is there an easy way to determine which of these packages PostgreSQL 
uses AND what for?



In this specific case, the stemmer goes wrong because System is a
foreign word whose ending is atypical for German.  The algorithm cannot
distinguish between System and, say, lautem or bestem.


Second, and more importantly, as I understand it, the stemmed version of
a word should be considered normalized. That is, all other versions of
that stem should be mapped to it as well. The interesting problem here
is that PostgreSQL maps the stem itself ('system') to a completely
different stem ('syst').

Should a stem not remain stable even when to_tsvector is called on it
multiple times?

That's a possible position, but consider that a stem is not necessarily
a valid German word.  If you treat it as a German word (by stemming it),
the results might not be what you desire.

For example:

test= select to_tsvector('german', 'linsen');
  to_tsvector
-
  'lins':1
(1 row)

test= select to_tsvector('german', 'lins');
  to_tsvector
-
  'lin':1
(1 row)


Sure. That might be the problem. It occurs to me that stems (if detected 
as such) should be left alone.

In case a stem is real German word, it should be stemmed to itself anyway
If not, it might help not to stem in order to avoid errors.


I guess that your real problem here is that a search for system
will not find systeme, which is indeed unfortunate.
But until somebody can come up with a better stemming algorithm, cases
like that can always occur.

Yours,
Laurenz Albe
This might pose a problem in the future of course. Thanks for pointing 
this out as well.


Regards,

--
Sven R. Kunze
TBZ-PARIV GmbH, Bernsdorfer Str. 210-212, 09126 Chemnitz
Tel: +49 (0)371 33714721, Fax: +49 (0)371 5347920
e-mail: srku...@tbz-pariv.de
web: www.tbz-pariv.de

Geschäftsführer: Dr. Reiner Wohlgemuth
Sitz der Gesellschaft: Chemnitz
Registergericht: Chemnitz HRB 8543



--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] [tsvector] to_tsvector called multiple times

2015-05-26 Thread Albe Laurenz
Sven R. Kunze wrote:
 However, are you sure, I am using snowball? Maybe, I am reading the
 documenation wrong:

test= SELECT * FROM ts_debug('german', 'system');
   alias   |   description   | token  | dictionaries  | dictionary  | lexemes
---+-++---+-+-
 asciiword | Word, all ASCII | system | {german_stem} | german_stem | {syst}
(1 row)

test= \dFd german_stem
List of text search dictionaries
   Schema   |Name | Description
+-+--
 pg_catalog | german_stem | snowball stemmer for german language
(1 row)

 http://www.postgresql.org/docs/9.3/static/textsearch-dictionaries.html
 but it seems as it depends on which packages (ispell, hunspell, myspell,
 snowball + corresponding languages) my system has installed.
 
 Is there an easy way to determine which of these packages PostgreSQL
 uses AND what for?

If you use a standard PostgreSQL distribution, you will have no ispell
dictionary (as the documentation you quote says).
You can always list all dictionaries with \dFd in psql.

 Sure. That might be the problem. It occurs to me that stems (if detected
 as such) should be left alone.
 In case a stem is real German word, it should be stemmed to itself anyway
 If not, it might help not to stem in order to avoid errors.

Yes, but that would mean that you have a way to determine from a string
whether it is a word or a stem or both, and the software does not do that.

Yours,
Laurenz Albe

-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] [tsvector] to_tsvector called multiple times

2015-05-26 Thread Albe Laurenz
Sven R. Kunze wrote:
 Maybe, I have difficulties to understand the relationship/dependencies
 between all these 'maybe' available dictionary/parser/stemmer packages.
 
 What happens if I install all packages for a single language? (hunspell,
 myspell, ispell, snowball)
 
 Are they complementary? Do they replace each other?

They are all dictionaries.

The text search configuration determines which dictionaries get
applied to which kinds of words in which order.

So if you introduce a new dictionary, you either have to modify
an existing configuration or efine a new one to use it.

Yours,
Laurenz Albe

-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] [tsvector] to_tsvector called multiple times

2015-05-26 Thread Sven R. Kunze

I think I understand now.

Thus, the issue at hand could (maybe) be solved by passing words first 
to one of those more elaborate dictionaries (myspell, hunspell or 
ispell) and if still necessary then to snowball.


Did I get this right?


On 26.05.2015 13:38, Albe Laurenz wrote:

Sven R. Kunze wrote:

Maybe, I have difficulties to understand the relationship/dependencies
between all these 'maybe' available dictionary/parser/stemmer packages.

What happens if I install all packages for a single language? (hunspell,
myspell, ispell, snowball)

Are they complementary? Do they replace each other?

They are all dictionaries.

The text search configuration determines which dictionaries get
applied to which kinds of words in which order.

So if you introduce a new dictionary, you either have to modify
an existing configuration or efine a new one to use it.

Yours,
Laurenz Albe


Regards,

--
Sven R. Kunze
TBZ-PARIV GmbH, Bernsdorfer Str. 210-212, 09126 Chemnitz
Tel: +49 (0)371 33714721, Fax: +49 (0)371 5347920
e-mail: srku...@tbz-pariv.de
web: www.tbz-pariv.de

Geschäftsführer: Dr. Reiner Wohlgemuth
Sitz der Gesellschaft: Chemnitz
Registergericht: Chemnitz HRB 8543



--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] [tsvector] to_tsvector called multiple times

2015-05-26 Thread Albe Laurenz
Sven R. Kunze wrote:
 I think I understand now.
 
 Thus, the issue at hand could (maybe) be solved by passing words first
 to one of those more elaborate dictionaries (myspell, hunspell or
 ispell) and if still necessary then to snowball.
 
 Did I get this right?

I have never experimented with ispell dictionaries, so I don't know
if they replace a snowball dictionary or are used in addition to it.

Yours,
Laurenz Albe

-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general