[ 
https://issues.apache.org/jira/browse/LUCENE-8532?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kiju Kim updated LUCENE-8532:
-----------------------------
    Description: 
We can reproduce it from Elasticsearch.

When we run the following command:

{{GET _analyze}}

{ 

"analyzer": "nori",

  "text": "공단시"

{{}}}

It returns the following as expected:

{
   "tokens": [
    

{       "token": "공단",       "start_offset": 0,       "end_offset": 2,       
"type": "word",       "position": 0     }

,
    

{       "token": "시",       "start_offset": 2,       "end_offset": 3,       
"type": "word",       "position": 1     }

  ]
 }

But if we run with "공단시 " (with a trailing space)

GET _analyze

{   "analyzer": "nori",   "text": "공단시 " }

It returns

{
   "tokens": [
    

{       "token": "공단",       "start_offset": 0,       "end_offset": 2,       
"type": "word",       "position": 0     }

,
    

{       *"token": "씨",*       "start_offset": 2,       "end_offset": 3,       
"type": "word",       "position": 1     }

  ]
 }

The second token should be "시" instead of  "씨".

  was:
We can reproduce it from Elasticsearch.

When we run the following command:

{{GET _analyze}}

{\{{   }}

{\{  "analyzer": "nori",   }}

{\{  "text": "공단시" }}

{{}}}

It returns the following as expected:

{
   "tokens": [
    {

      "token": "공단",

      "start_offset": 0,

      "end_offset": 2,

      "type": "word",

      "position": 0

    },
    {

      "token": "시",

      "start_offset": 2,

      "end_offset": 3,

      "type": "word",

      "position": 1

    }

  ]
 }

But if we run with "공단시 " (with a trailing space)

GET _analyze

{

  "analyzer": "nori",

  "text": "공단시 "

}

It returns

{
   "tokens": [
    {

      "token": "공단",

      "start_offset": 0,

      "end_offset": 2,

      "type": "word",

      "position": 0

    },
    {

      *"token": "씨",*

      "start_offset": 2,

      "end_offset": 3,

      "type": "word",

      "position": 1

    }

  ]
 }

The second token should be "시" instead of  "씨".


> nori analyzer issue with trailing space
> ---------------------------------------
>
>                 Key: LUCENE-8532
>                 URL: https://issues.apache.org/jira/browse/LUCENE-8532
>             Project: Lucene - Core
>          Issue Type: Bug
>          Components: modules/analysis
>    Affects Versions: 7.4
>         Environment: Elasticsearch version: Version: Version: 6.4.2, Build: 
> default/tar/04711c2/2018-09-26T13:34:09.098244Z, JVM: 1.8.0_131
> Plugins installed: [analysis-nori]
> JVM version:
> java version "1.8.0_131"
> Java(TM) SE Runtime Environment (build 1.8.0_131-b11)
> Java HotSpot(TM) 64-Bit Server VM (build 25.131-b11, mixed mode)
> OS version: Darwin Kijuui-MacBook-Pro.local 17.7.0 Darwin Kernel Version 
> 17.7.0: Thu Jun 21 22:53:14 PDT 2018; root:xnu-4570.71.2~1/RELEASE_X86_64 
> x86_64
>            Reporter: Kiju Kim
>            Priority: Major
>
> We can reproduce it from Elasticsearch.
> When we run the following command:
> {{GET _analyze}}
> { 
> "analyzer": "nori",
>   "text": "공단시"
> {{}}}
> It returns the following as expected:
> {
>    "tokens": [
>     
> {       "token": "공단",       "start_offset": 0,       "end_offset": 2,       
> "type": "word",       "position": 0     }
> ,
>     
> {       "token": "시",       "start_offset": 2,       "end_offset": 3,       
> "type": "word",       "position": 1     }
>   ]
>  }
> But if we run with "공단시 " (with a trailing space)
> GET _analyze
> {   "analyzer": "nori",   "text": "공단시 " }
> It returns
> {
>    "tokens": [
>     
> {       "token": "공단",       "start_offset": 0,       "end_offset": 2,       
> "type": "word",       "position": 0     }
> ,
>     
> {       *"token": "씨",*       "start_offset": 2,       "end_offset": 3,       
> "type": "word",       "position": 1     }
>   ]
>  }
> The second token should be "시" instead of  "씨".



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to