[
https://issues.apache.org/jira/browse/LUCENE-8706?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Yeongsu Kim updated LUCENE-8706:
--------------------------------
Description:
I found a bug in Nori tokenizer.
Let me describe what the problem is, using a concrete example.
Let assume, we have below dictionaries.
< userdict_ko.txt >
[ “lg”, “lgtv lg tv”, “tv”, “엘지티비”, “엘지”, “텔레비전”, “티비”, “하이” ]
(“lgtv” is compound word)
< synonyms.txt >
[ “lgtv,엘지티비”, “lg,엘지”, “tv,텔레비전,티비” ]
Let’s see the results according to below queries.
* Query1 : lgtv
* Query2 : lg하이tv
* Query3 : lg tv
Also, we will use all different decompound-modes such as “NONE”, “DISCARD”,
“MIXED”.
Here are test cases.
* Test 1 (Query 1 + “MIXED”) - the analysis result is [“엘지티비”, “lgtv”, “lg”,
“tv”]
* Test 2 (Query 1 + “NONE”) - the analysis result is [“엘지티비”, “lgtv”]
* Test 3 (Query 1 + “DISCARD”) - the analysis result is [“엘지티비”, “lg”, “tv”]
* Test 4 (Query 2 + “MIXED”) - the analysis result is [“엘지”, “lg”, “하이”,
“텔레비전”, “티비”, “tv”]
* Test 5 (Query 2 + “NONE”) - the analysis result is [“엘지”, “lg”, “하이”,
“텔레비전”, “티비”, “tv”]
* Test 6 (Query 2 + “DISCARD”) - the analysis result is [“엘지”, “lg”, “하이”,
“텔레비전”, “티비”, “tv”]
* Test 7 (Query 3 + “MIXED”) - the analysis result is [“엘지”, “lg”, “텔레비전”,
“티비”, “tv”]
* Test 8 (Query 3 + “NONE”) - the analysis result is [“엘지”, “lg”, “텔레비전”,
“티비”, “tv”]
* Test 9 (Query 3 + “DISCARD”) - the analysis result is [“엘지티비”, “lg”, “tv”]
=> (Here is the problem!!!)
I don’t understand why Test 9 has that analysis result. The result should be
[“엘지”, “lg”, “텔레비전”, “티비”, “tv”] because the query 3 has some spaces between
“lg” and “tv”.
The only difference between “DISCARD” and other modes, is that “DISCARD” do not
store the compound token (e.g. “lgtv”) to the pending list. Since “DISCARD” do
not have the compound token, it may understand consecutive tokens, “lg”, “tv”
as compound token “lgtv”. However, there are many cases to make “lg”,”tv”. For
example, “lg tv”, “lg * tv”, “lg /// tv”, etc. (Space and punctuations are
deleted after tokenizing). It should differentiate “lg tv” from “lgtv”.
I guess that it needs to fix communication between nori tokenizer and general
synonym filter.
Thanks.
P.S.
The existing nori has error when using both synonyms and “MIXED” mode. For this
test, I temporarily delete `compoundToken.setPositionIncrement(0);` in
KoreanTokenizer.java` because SynonymMap.java throws IllegalArgumentException
when position increment is not 1.
was:
I found a bug in Nori tokenizer.
Let me describe what the problem is, using a concrete example.
Let assume, we have below dictionaries.
< userdict_ko.txt >
[ “lg”, “lgtv lg tv”, “tv”, “엘지티비”, “엘지”, “텔레비전”, “티비”, “하이” ]
(“lgtv” is compound word)
< synonyms.txt >
[ “lgtv,엘지티비”, “lg,엘지”, “tv,텔레비전,티비” ]
Let’s see the results according to below queries.
* Query1 : lgtv
* Query2 : lg하이tv
* Query3 : lg tv
Also, we will use all different decompound-modes such as “NONE”, “DISCARD”,
“MIXED”.
Here are test cases.
* Test 1 (Query 1 + “MIXED”) - the analysis result is [“엘지티비”, “lgtv”, “lg”,
“tv”]
* Test 2 (Query 1 + “NONE”) - the analysis result is [“엘지티비”, “lgtv”]
* Test 3 (Query 1 + “DISCARD”) - the analysis result is [“엘지티비”, “lg”, “tv”]
* Test 4 (Query 2 + “MIXED”) - the analysis result is [“엘지”, “lg”, “하이”,
“텔레비전”, “티비”, “tv”]
* Test 5 (Query 2 + “NONE”) - the analysis result is [“엘지”, “lg”, “하이”,
“텔레비전”, “티비”, “tv”]
* Test 6 (Query 2 + “DISCARD”) - the analysis result is [“엘지”, “lg”, “하이”,
“텔레비전”, “티비”, “tv”]
* Test 7 (Query 3 + “MIXED”) - the analysis result is [“엘지”, “lg”, “텔레비전”,
“티비”, “tv”]
* Test 8 (Query 3 + “NONE”) - the analysis result is [“엘지”, “lg”, “텔레비전”,
“티비”, “tv”]
* Test 9 (Query 3 + “DISCARD”) - the analysis result is [“엘지티비”, “lg”, “tv”]
=> (Here is the problem!!!)
I don’t understand why Test 9 has that analysis result. The result should be
[“엘지”, “lg”, “텔레비전”, “티비”, “tv”] because the query 3 has some spaces between
“lg” and “tv”.
The only difference between “DISCARD” and other modes, is that “DISCARD” do not
store the compound token (e.g. “lgtv”) to the pending list. Since “DISCARD” do
not have the compound token, it may understand consecutive tokens, “lg”, “tv”
as compound token “lgtv”. However, there are many cases to make “lg”,”tv”. For
example, “lg tv”, “lg * tv”, “lg /// tv”, etc. (Space and punctuations are
deleted after tokenizing). It should differentiate “lg tv” from “lgtv”.
I guess that it needs to fix communication between nori tokenizer and general
synonym filter.
Thanks.
P.S.
The existing nori has error when using both synonyms and “MIXED” mode. For this
test, I temporarily delete `compoundToken.setPositionIncrement(0);` in
KoreanTokenizer.java` because SynonymMap.java throws IllegalArgumentException
when position increment is not 1.
> Nori with DISCARD mode misunderstands compound words, when synonym expansion
> -----------------------------------------------------------------------------
>
> Key: LUCENE-8706
> URL: https://issues.apache.org/jira/browse/LUCENE-8706
> Project: Lucene - Core
> Issue Type: Bug
> Components: modules/analysis
> Reporter: Yeongsu Kim
> Priority: Minor
>
> I found a bug in Nori tokenizer.
> Let me describe what the problem is, using a concrete example.
> Let assume, we have below dictionaries.
> < userdict_ko.txt >
> [ “lg”, “lgtv lg tv”, “tv”, “엘지티비”, “엘지”, “텔레비전”, “티비”, “하이” ]
> (“lgtv” is compound word)
> < synonyms.txt >
> [ “lgtv,엘지티비”, “lg,엘지”, “tv,텔레비전,티비” ]
>
> Let’s see the results according to below queries.
> * Query1 : lgtv
> * Query2 : lg하이tv
> * Query3 : lg tv
>
> Also, we will use all different decompound-modes such as “NONE”, “DISCARD”,
> “MIXED”.
> Here are test cases.
> * Test 1 (Query 1 + “MIXED”) - the analysis result is [“엘지티비”, “lgtv”,
> “lg”, “tv”]
> * Test 2 (Query 1 + “NONE”) - the analysis result is [“엘지티비”, “lgtv”]
> * Test 3 (Query 1 + “DISCARD”) - the analysis result is [“엘지티비”, “lg”,
> “tv”]
>
> * Test 4 (Query 2 + “MIXED”) - the analysis result is [“엘지”, “lg”, “하이”,
> “텔레비전”, “티비”, “tv”]
> * Test 5 (Query 2 + “NONE”) - the analysis result is [“엘지”, “lg”, “하이”,
> “텔레비전”, “티비”, “tv”]
> * Test 6 (Query 2 + “DISCARD”) - the analysis result is [“엘지”, “lg”, “하이”,
> “텔레비전”, “티비”, “tv”]
>
> * Test 7 (Query 3 + “MIXED”) - the analysis result is [“엘지”, “lg”, “텔레비전”,
> “티비”, “tv”]
> * Test 8 (Query 3 + “NONE”) - the analysis result is [“엘지”, “lg”, “텔레비전”,
> “티비”, “tv”]
> * Test 9 (Query 3 + “DISCARD”) - the analysis result is [“엘지티비”, “lg”,
> “tv”] => (Here is the problem!!!)
>
> I don’t understand why Test 9 has that analysis result. The result should be
> [“엘지”, “lg”, “텔레비전”, “티비”, “tv”] because the query 3 has some spaces between
> “lg” and “tv”.
>
> The only difference between “DISCARD” and other modes, is that “DISCARD” do
> not store the compound token (e.g. “lgtv”) to the pending list. Since
> “DISCARD” do not have the compound token, it may understand consecutive
> tokens, “lg”, “tv” as compound token “lgtv”. However, there are many cases to
> make “lg”,”tv”. For example, “lg tv”, “lg * tv”, “lg /// tv”, etc. (Space and
> punctuations are deleted after tokenizing). It should differentiate “lg tv”
> from “lgtv”.
>
> I guess that it needs to fix communication between nori tokenizer and general
> synonym filter.
> Thanks.
>
> P.S.
> The existing nori has error when using both synonyms and “MIXED” mode. For
> this test, I temporarily delete `compoundToken.setPositionIncrement(0);` in
> KoreanTokenizer.java` because SynonymMap.java throws IllegalArgumentException
> when position increment is not 1.
>
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]