On Tue, 28 Aug 2018 12:40:32 +0700 Aleksandr Parfenov <a.parfe...@postgrespro.ru> wrote:
>On Fri, 24 Aug 2018 18:50:38 +0300 >Alexander Korotkov <a.korot...@postgrespro.ru> wrote: >>Agreed, backward compatibility is important here. Probably we should >>leave old dictionaries for that. But I just meant that if we >>introduce new (better) way of stop words handling and encourage users >>to use it, then it would look strange if default configurations work >>the old way... > >I agree with Alexander. The only drawback I see is that after addition >of new dictionaries, there will be 3 dictionaries for each language: >old one, stop-word filter for the language, and stemmer dictionary. During work on the new version of the patch, I found an issue in proposed syntax. At the beginning of the conversation, there was a suggestion to split stop word filtering and words normalization. At this stage of development, we can use a different dictionary for stop word detection, but if we drop the word, the word counter wouldn't increase and the stop word will be processed as an unknown word. Currently, I see two solutions: 1) Keep the old way of stop word filtering. The drawback of this approach is the mixing of word normalization and stop word detection logic inside of a dictionary. It can be solved by the usage of 'simple' dictionary in accept=false mode as a stop word filter. 2) Add an action STOPWORD to KEEP and DROP (which is not implemented in previous patch, but I think it is good to have both of them) in the meaning of "increase word counter but don't add lexeme to vector". Any suggestions on the issue? -- Aleksandr Parfenov Postgres Professional: http://www.postgrespro.com Russian Postgres Company