gritmind commented on a change in pull request #576: LUCENE-8631:
Longest-Matching for User words in Nori Tokenizer
URL: https://github.com/apache/lucene-solr/pull/576#discussion_r264109081
##########
File path:
lucene/analysis/nori/src/java/org/apache/lucene/analysis/ko/KoreanTokenizer.java
##########
@@ -651,20 +662,37 @@ private void parse() throws IOException {
if (userFST != null) {
userFST.getFirstArc(arc);
int output = 0;
+ int posAheadMax = 0;
+ int output_posAheadMax = 0;
+ int arcFinalOut_posAheadMax = 0;
+
for(int posAhead=pos;;posAhead++) {
final int ch = buffer.get(posAhead);
- if (ch == -1) {
- break;
- }
- if (userFST.findTargetArc(ch, arc, arc, posAhead == pos,
userFSTReader) == null) {
- break;
+ if (ch == -1 || userFST.findTargetArc(ch, arc, arc, posAhead == pos,
userFSTReader) == null) {
+ if (anyMatches
+ && posAheadMax > global_posAheadMax) {
Review comment:
First, I really appreciate your feedback (@jimczi). I updated my code based
on your comments.
[Here](https://github.com/apache/lucene-solr/pull/576/commits/0e8dd66a521eb479dca79a98c1c8d321e24addbe)
is new version.
## 1.
Why do we need a variable,`global_posAheadMax` and a condition `posAheadMax
> global_posAheadMax)`?
Without them, the longest matching might not work, if the first characters
of the query are alphabetic and/or numeric and the rest are more than two
subset user words.
Let me explain with a concrete example.
Let assume we have user words such as {'대한민국', '날씨', '대한민국날씨', '21세기대한민국',
'세기'}. When the query is '대한민국날씨', then the analysis result is '대한민국날씨'.
Longest matching is successfully applied to this case. However, when the query
is '21세기대한민국', the analysis result is ['21', '세기', '대한민국']. Unfortunately, the
longest matching is not working to this case.
Why?
Numeric and alphabetic characters (rather than hangul characters) always
become tokens as unknown words (,which can be a part of a viterbi search path)
even though they are also in user words and/or known words. (It might be
related to the condition, `characterDefinition.isInvoke(firstCharacter)`).
Given this fact, '21' in '21세기대한민국' can be a token, and token combination
['21', '세기', '대한민국'] is generated. Since there are two user words such as '세기',
and '대한민국', the cost of this token combination is far lower than the cost of
longest token, ['21세기대한민국']. (the cost of each user is -100,000). So, ['21',
'세기', '대한민국'] win against ['21세기대한민국'].
To avoid this, we ensure that after a user word (e.g. '21세기대한민국') is added,
its subset words (e.g. '세기', '대한민국') could not be added. To implement this,
always remember the maximum index of added user words via `global_posAheadMax`,
and use the condition `posAheadMax > global_posAheadMax)` for avoiding adding
subset words.
## 2.
As you mentioned, I updated some codes for readability.
* re-naming variables based on java style
* moving the code after the for loop with a longest matching comment
* replacing 'max_int' function with 'Math.max' function
For checking the updated code, we again tested it with 100,000 Korean word
cases, where subset words are included, and all test cases were passed.
## 3.
I added two unit test cases for longest matching in `TestKoreanTokenizer`.
(Also, added some user words in `userdict.txt`.)
Thank you :)
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]
With regards,
Apache Git Services
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]