[PR] Add new token filters for Japanese sutegana (捨て仮名) [lucene]

via GitHub Mon, 11 Dec 2023 06:45:41 -0800


daixque opened a new pull request, #12915:
URL: https://github.com/apache/lucene/pull/12915

### Description

Sutegana (捨て仮名) is small letter of hiragana and katakana in Japanese. In the
old Japanese text, sutegana (捨て仮名) is not used unlikely to modern one. For
example:

- "ストップウォッチ" is written as "ストツプウオツチ"
- "ちょっとまって" is written as "ちよつとまつて"

So it's meaningful to normalize sutegana to normal (uppercase) characters if
we search against the corpus which includes old Japanese text such as patents,
legal documents, contract policies, etc.

This pull request introduces 2 token filters:

- JapaneseHiraganaUppercaseFilter for hiragana
- JapaneseKatakanaUppercaseFilter for katakana

so that user can use either one separately. Each. filter make all the
sutegana (small characters) into normal kana (uppercase character) to normalize
the token.

### Why it is needed

This transformation must be done as token filter. There have already been
[MappingCharFilter](https://lucene.apache.org/core/8_0_0/analyzers-common/org/apache/lucene/analysis/charfilter/MappingCharFilter.html),
but if we apply this character filter to normalize sutegana, it will impact to
tokenization and it is not expected.

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[PR] Add new token filters for Japanese sutegana (捨て仮名) [lucene]

Reply via email to