The plugin uses collation to identify characters which are equivalent. It does far more than simple replacement/folding, so sometimes the sort order matters.
http://en.wikipedia.org/wiki/Collation http://userguide.icu-project.org/transforms/normalization Take a look at the plugin's test to figure out how it is used. I only work with English/Mandarin, so I do not know how useful it is with Dutch. https://github.com/elasticsearch/elasticsearch-analysis-icu/tree/master/src/test/java/org/elasticsearch/index/analysis Cheers, Ivan On Tue, Aug 19, 2014 at 12:46 AM, Matthias Hogerheijde < [email protected]> wrote: > Thanks for your reply. I see that I didn't fully understand that > CharFilters are ran first, which makes it logical to special-case the > different cases. I was originally thrown off-scent that searching with an > uppercase 'Y' worked and thought that the lowercase filter was not applied > to the 'Y', but now I see that searching for a 'y' will cause the mapper to > search for 'ij' in stead. > > I don't understand the full extend of the icu analysers, but it seems to > me that in our case this is semantically different, since we regard 'Y' and > 'IJ' as different letters? (note that we actually regard 'ij' to be a > single character.) It's not like removing the accents from 'รค', or > transcribing a Cyrillic number into it's Roman equivalent, or am I wrong to > that regard? > > Regards, > Matthias > > On Tuesday, August 19, 2014 6:37:29 AM UTC+2, Ivan Brusic wrote: > >> Char filters are applied before the text is tokenized, and therefore they >> are applied before the "normal" filters are used, which is why they are a >> separate class of filter. With Lucene, the order is: >> >> char filters -> tokenizer -> filters >> >> Have you looked into the ICU analyzer? http://www. >> elasticsearch.org/guide/en/elasticsearch/reference/ >> current/analysis-icu-plugin.html >> <http://www.google.com/url?q=http%3A%2F%2Fwww.elasticsearch.org%2Fguide%2Fen%2Felasticsearch%2Freference%2Fcurrent%2Fanalysis-icu-plugin.html&sa=D&sntz=1&usg=AFQjCNGvdkiBOpv0quMGWpUHS15nSr8aug> >> >> I have no idea how well it works with Dutch. >> >> Cheers, >> >> Ivan >> >> >> On Mon, Aug 18, 2014 at 2:14 AM, Matthias Hogerheijde < >> [email protected]> wrote: >> >>> Hi, >>> >>> We're using Elasticsearch with an Analyzer to map the `y` character to >>> `ij`, (*char_fitler* named "char_mapper") since in Dutch these two are >>> "somewhat" interchangeable. We're also using a *lowercase filter*. >>> >>> This is the configuration: >>> >>> { >>> "analysis": { >>> "analyzer": { >>> "index": { >>> "type": "custom", >>> "tokenizer": "standard", >>> "filter": [ >>> "lowercase", >>> "synonym_twoway", >>> "standard", >>> "asciifolding" >>> ], >>> "char_filter": [ >>> "char_mapper" >>> ] >>> }, >>> "index_prefix": { >>> "type": "custom", >>> "tokenizer": "standard", >>> "filter": [ >>> "lowercase", >>> "synonym_twoway", >>> "standard", >>> "asciifolding", >>> "prefixes" >>> ], >>> "char_filter": [ >>> "char_mapper" >>> ] >>> }, >>> "search": { >>> "alias": [ >>> "default" >>> ], >>> "type": "custom", >>> "tokenizer": "standard", >>> "filter": [ >>> "lowercase", >>> "synonym", >>> "synonym_twoway", >>> "standard", >>> "asciifolding" >>> ], >>> "char_filter": [ >>> "char_mapper" >>> ] >>> }, >>> "postal_code": { >>> "tokenizer": "keyword", >>> "filter": [ >>> "lowercase" >>> ] >>> } >>> }, >>> "tokenizer": { >>> "standard": { >>> "stopwords": [ >>> >>> >>> ] >>> } >>> }, >>> "filter": { >>> "synonym": { >>> "type": "synonym", >>> "synonyms": [ >>> "st => sint", >>> "jp => jan pieterszoon", >>> "mh => maarten harpertszoon" >>> ] >>> }, >>> "synonym_twoway": { >>> "type": "synonym", >>> "synonyms": [ >>> "den haag, s gravenhage", >>> "den bosch, s hertogenbosch" >>> ] >>> }, >>> "prefixes": { >>> "type": "edgeNGram", >>> "side": "front", >>> "min_gram": 1, >>> "max_gram": 30 >>> } >>> }, >>> "char_filter": { >>> "char_mapper": { >>> "type": "mapping", >>> "mappings": [ >>> "y => ij" >>> ] >>> } >>> } >>> } >>> } >>> >>> When indexing cities, we're using this mapping: >>> >>> { >>> "properties": { >>> "city": { >>> "type": "multi_field", >>> "fields": { >>> "city": { >>> "type": "string" >>> }, >>> "prefix": { >>> "type": "string", >>> "boost": 0.5, >>> "index_analyzer": "index_prefix" >>> } >>> } >>> }, >>> "province_code": { >>> "type": "string" >>> }, >>> "unique_name": { >>> "type": "boolean" >>> }, >>> "point": { >>> "type": "geo_point" >>> }, >>> "search_terms": { >>> "type": "multi_field", >>> "fields": { >>> "search_terms": { >>> "type": "string" >>> }, >>> "prefix": { >>> "boost": 0.5, >>> "index_analyzer": "index_prefix", >>> "type": "string" >>> } >>> } >>> } >>> }, >>> "search_analyzer": "search", >>> "index_analyzer": "index" >>> } >>> >>> When we index all the (Dutch) cities from our data-source, there are >>> cities starting with both `IJ` and `Y`. (for example, these citiy names >>> exist: *IJssel*, *IJsselstein*, *Yerseke* and *Ysselsteyn.*) It seems >>> that these characters are not lowercased before the char_mapping is >>> applied. >>> >>> Querying the index, results in >>> >>> /top/city/_search?q=ijsselstein -> works, returns the document for >>> IJsselstein >>> /top/city/_search?q=Ijsselstein -> works, returns the document for >>> IJsselstein >>> /top/city/_search?q=yerseke -> *doesn't *work, returns nothing >>> /top/city/_search?q=Yerseke -> *does *work, returns the document for >>> Yerseke >>> /top/city/_search?q=YsselsteYn -> *doesn't *work, returns nothing >>> /top/city/_search?q=Ysselsteyn -> *does *work, returns the document for >>> Ysselsteyn >>> >>> Changing the case of any other letter doesn't affect the results. >>> >>> I've worked around this issue by adding the mapping "Y => ij", i.e.: >>> >>> "char_filter": { >>> "char_mapper": { >>> "type": "mapping", >>> "mappings": [ >>> "y => ij", >>> "Y => ij" >>> ] >>> } >>> } >>> >>> This solves the problem, but I'd rather see that the lowercase filter is >>> applied before the mapping, or, that I can make the order explicit. Is >>> there any stance on this issue? Or is this intended behaviour? >>> >>> Regards, >>> Matthias Hogerheijde >>> >>> >>> >>> -- >>> You received this message because you are subscribed to the Google >>> Groups "elasticsearch" group. >>> To unsubscribe from this group and stop receiving emails from it, send >>> an email to [email protected]. >>> >>> To view this discussion on the web visit https://groups.google.com/d/ >>> msgid/elasticsearch/c60de452-2a3f-42f7-a677-956f81ecec17% >>> 40googlegroups.com >>> <https://groups.google.com/d/msgid/elasticsearch/c60de452-2a3f-42f7-a677-956f81ecec17%40googlegroups.com?utm_medium=email&utm_source=footer> >>> . >>> For more options, visit https://groups.google.com/d/optout. >>> >> >> -- > You received this message because you are subscribed to the Google Groups > "elasticsearch" group. > To unsubscribe from this group and stop receiving emails from it, send an > email to [email protected]. > To view this discussion on the web visit > https://groups.google.com/d/msgid/elasticsearch/e18b3d66-0cec-49ae-9bea-af699ce5a97c%40googlegroups.com > <https://groups.google.com/d/msgid/elasticsearch/e18b3d66-0cec-49ae-9bea-af699ce5a97c%40googlegroups.com?utm_medium=email&utm_source=footer> > . > > For more options, visit https://groups.google.com/d/optout. > -- You received this message because you are subscribed to the Google Groups "elasticsearch" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CALY%3DcQCjkqWBqs8u8QyGCtZ7UBZjPA346j2uMbZM8wpXKha1OA%40mail.gmail.com. For more options, visit https://groups.google.com/d/optout.
