Re: Using a char_filter in combination with a lowercase filter

Ivan Brusic Tue, 19 Aug 2014 09:37:37 -0700

The plugin uses collation to identify characters which are equivalent. It
does far more than simple replacement/folding, so sometimes the sort order
matters.


http://en.wikipedia.org/wiki/Collation
http://userguide.icu-project.org/transforms/normalization

Take a look at the plugin's test to figure out how it is used. I only work
with English/Mandarin, so I do not know how useful it is with Dutch.

https://github.com/elasticsearch/elasticsearch-analysis-icu/tree/master/src/test/java/org/elasticsearch/index/analysis

Cheers,

Ivan


On Tue, Aug 19, 2014 at 12:46 AM, Matthias Hogerheijde <
[email protected]> wrote:

> Thanks for your reply. I see that I didn't fully understand that
> CharFilters are ran first, which makes it logical to special-case the
> different cases. I was originally thrown off-scent that searching with an
> uppercase 'Y' worked and thought that the lowercase filter was not applied
> to the 'Y', but now I see that searching for a 'y' will cause the mapper to
> search for 'ij' in stead.
>
> I don't understand the full extend of the icu analysers, but it seems to
> me that in our case this is semantically different, since we regard 'Y' and
> 'IJ' as different letters? (note that we actually regard 'ij' to be a
> single character.) It's not like removing the accents from 'ä', or
> transcribing a Cyrillic number into it's Roman equivalent, or am I wrong to
> that regard?
>
> Regards,
> Matthias
>
> On Tuesday, August 19, 2014 6:37:29 AM UTC+2, Ivan Brusic wrote:
>
>> Char filters are applied before the text is tokenized, and therefore they
>> are applied before the "normal" filters are used, which is why they are a
>> separate class of filter. With Lucene, the order is:
>>
>> char filters -> tokenizer -> filters
>>
>> Have you looked into the ICU analyzer? http://www.
>> elasticsearch.org/guide/en/elasticsearch/reference/
>> current/analysis-icu-plugin.html
>> <http://www.google.com/url?q=http%3A%2F%2Fwww.elasticsearch.org%2Fguide%2Fen%2Felasticsearch%2Freference%2Fcurrent%2Fanalysis-icu-plugin.html&sa=D&sntz=1&usg=AFQjCNGvdkiBOpv0quMGWpUHS15nSr8aug>
>>
>> I have no idea how well it works with Dutch.
>>
>> Cheers,
>>
>> Ivan
>>
>>
>> On Mon, Aug 18, 2014 at 2:14 AM, Matthias Hogerheijde <
>> [email protected]> wrote:
>>
>>> Hi,
>>>
>>> We're using Elasticsearch with an Analyzer to map the `y` character to
>>> `ij`, (*char_fitler* named "char_mapper") since in Dutch these two are
>>> "somewhat" interchangeable. We're also using a *lowercase filter*.
>>>
>>> This is the configuration:
>>>
>>> {
>>>   "analysis": {
>>>     "analyzer": {
>>>       "index": {
>>>         "type": "custom",
>>>         "tokenizer": "standard",
>>>         "filter": [
>>>           "lowercase",
>>>           "synonym_twoway",
>>>           "standard",
>>>           "asciifolding"
>>>         ],
>>>         "char_filter": [
>>>           "char_mapper"
>>>         ]
>>>       },
>>>       "index_prefix": {
>>>         "type": "custom",
>>>         "tokenizer": "standard",
>>>         "filter": [
>>>           "lowercase",
>>>           "synonym_twoway",
>>>           "standard",
>>>           "asciifolding",
>>>           "prefixes"
>>>         ],
>>>         "char_filter": [
>>>           "char_mapper"
>>>         ]
>>>       },
>>>       "search": {
>>>         "alias": [
>>>           "default"
>>>         ],
>>>         "type": "custom",
>>>         "tokenizer": "standard",
>>>         "filter": [
>>>           "lowercase",
>>>           "synonym",
>>>           "synonym_twoway",
>>>           "standard",
>>>           "asciifolding"
>>>         ],
>>>         "char_filter": [
>>>           "char_mapper"
>>>         ]
>>>       },
>>>       "postal_code": {
>>>         "tokenizer": "keyword",
>>>         "filter": [
>>>           "lowercase"
>>>         ]
>>>       }
>>>     },
>>>     "tokenizer": {
>>>       "standard": {
>>>         "stopwords": [
>>>
>>>
>>>         ]
>>>       }
>>>     },
>>>     "filter": {
>>>       "synonym": {
>>>         "type": "synonym",
>>>         "synonyms": [
>>>           "st => sint",
>>>           "jp => jan pieterszoon",
>>>           "mh => maarten harpertszoon"
>>>         ]
>>>       },
>>>       "synonym_twoway": {
>>>         "type": "synonym",
>>>         "synonyms": [
>>>           "den haag, s gravenhage",
>>>           "den bosch, s hertogenbosch"
>>>         ]
>>>       },
>>>       "prefixes": {
>>>         "type": "edgeNGram",
>>>         "side": "front",
>>>         "min_gram": 1,
>>>         "max_gram": 30
>>>       }
>>>     },
>>>     "char_filter": {
>>>       "char_mapper": {
>>>         "type": "mapping",
>>>         "mappings": [
>>>           "y => ij"
>>>         ]
>>>       }
>>>     }
>>>   }
>>> }
>>>
>>> When indexing cities, we're using this mapping:
>>>
>>> {
>>>   "properties": {
>>>     "city": {
>>>       "type": "multi_field",
>>>       "fields": {
>>>         "city": {
>>>           "type": "string"
>>>         },
>>>         "prefix": {
>>>           "type": "string",
>>>           "boost": 0.5,
>>>           "index_analyzer": "index_prefix"
>>>         }
>>>       }
>>>     },
>>>     "province_code": {
>>>       "type": "string"
>>>     },
>>>     "unique_name": {
>>>       "type": "boolean"
>>>     },
>>>     "point": {
>>>       "type": "geo_point"
>>>     },
>>>     "search_terms": {
>>>       "type": "multi_field",
>>>       "fields": {
>>>         "search_terms": {
>>>           "type": "string"
>>>         },
>>>         "prefix": {
>>>           "boost": 0.5,
>>>           "index_analyzer": "index_prefix",
>>>           "type": "string"
>>>         }
>>>       }
>>>     }
>>>   },
>>>   "search_analyzer": "search",
>>>   "index_analyzer": "index"
>>> }
>>>
>>> When we index all the (Dutch) cities from our data-source, there are
>>> cities starting with both `IJ` and `Y`. (for example, these citiy names
>>> exist: *IJssel*, *IJsselstein*, *Yerseke* and *Ysselsteyn.*) It seems
>>> that these characters are not lowercased before the char_mapping is
>>> applied.
>>>
>>> Querying the index, results in
>>>
>>> /top/city/_search?q=ijsselstein -> works, returns the document for
>>> IJsselstein
>>> /top/city/_search?q=Ijsselstein -> works, returns the document for
>>> IJsselstein
>>> /top/city/_search?q=yerseke -> *doesn't *work, returns nothing
>>> /top/city/_search?q=Yerseke -> *does *work, returns the document for
>>> Yerseke
>>> /top/city/_search?q=YsselsteYn -> *doesn't *work, returns nothing
>>> /top/city/_search?q=Ysselsteyn -> *does *work, returns the document for
>>> Ysselsteyn
>>>
>>> Changing the case of any other letter doesn't affect the results.
>>>
>>> I've worked around this issue by adding the mapping "Y => ij", i.e.:
>>>
>>> "char_filter": {
>>>   "char_mapper": {
>>>     "type": "mapping",
>>>     "mappings": [
>>>       "y => ij",
>>>       "Y => ij"
>>>     ]
>>>   }
>>> }
>>>
>>> This solves the problem, but I'd rather see that the lowercase filter is
>>> applied before the mapping, or, that I can make the order explicit. Is
>>> there any stance on this issue? Or is this intended behaviour?
>>>
>>> Regards,
>>> Matthias Hogerheijde
>>>
>>>
>>>
>>>  --
>>> You received this message because you are subscribed to the Google
>>> Groups "elasticsearch" group.
>>> To unsubscribe from this group and stop receiving emails from it, send
>>> an email to [email protected].
>>>
>>> To view this discussion on the web visit https://groups.google.com/d/
>>> msgid/elasticsearch/c60de452-2a3f-42f7-a677-956f81ecec17%
>>> 40googlegroups.com
>>> <https://groups.google.com/d/msgid/elasticsearch/c60de452-2a3f-42f7-a677-956f81ecec17%40googlegroups.com?utm_medium=email&utm_source=footer>
>>> .
>>> For more options, visit https://groups.google.com/d/optout.
>>>
>>
>>  --
> You received this message because you are subscribed to the Google Groups
> "elasticsearch" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to [email protected].
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/elasticsearch/e18b3d66-0cec-49ae-9bea-af699ce5a97c%40googlegroups.com
> <https://groups.google.com/d/msgid/elasticsearch/e18b3d66-0cec-49ae-9bea-af699ce5a97c%40googlegroups.com?utm_medium=email&utm_source=footer>
> .
>
> For more options, visit https://groups.google.com/d/optout.
>

-- 
You received this message because you are subscribed to the Google Groups 
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/CALY%3DcQCjkqWBqs8u8QyGCtZ7UBZjPA346j2uMbZM8wpXKha1OA%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Re: Using a char_filter in combination with a lowercase filter

Reply via email to