daixque opened a new pull request, #12915:
URL: https://github.com/apache/lucene/pull/12915

   ### Description
   
   Sutegana (捨て仮名) is small letter of hiragana and katakana in Japanese. In the 
old Japanese text, sutegana (捨て仮名) is not used unlikely to modern one. For 
example:
   
   - "ストップウォッチ" is written as "ストツプウオツチ"
   - "ちょっとまって" is written as "ちよつとまつて"
   
   So it's meaningful to normalize sutegana to normal (uppercase) characters if 
we search against the corpus which includes old Japanese text such as patents, 
legal documents, contract policies, etc.
   
   This pull request introduces 2 token filters:
   
   - JapaneseHiraganaUppercaseFilter for hiragana
   - JapaneseKatakanaUppercaseFilter for katakana
   
   so that user can use either one separately. Each. filter make all the 
sutegana (small characters) into normal kana (uppercase character) to normalize 
the token.
   
   ### Why it is needed
   
   This transformation must be done as token filter. There have already been 
[MappingCharFilter](https://lucene.apache.org/core/8_0_0/analyzers-common/org/apache/lucene/analysis/charfilter/MappingCharFilter.html),
 but if we apply this character filter to normalize sutegana, it will impact to 
tokenization and it is not expected.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Reply via email to