[ 
https://issues.apache.org/jira/browse/SOLR-8334?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yeo Zheng Lin updated SOLR-8334:
--------------------------------
    Description: 
When I tried to use the JiebaTokenizerFactory to index Chinese characters in 
Solr, it works fine with the segmentation when I'm using the Analysis function 
on the Solr Admin UI.

However, when I tried to do the highlighting in Solr, it is not highlighting in 
the correct place. For example, when I search of 自然环境与企业本身, it highlight 
认<em>为自然环</em><em>境</em><em>与企</em><em>业本</em>身的
Even when I search for English character like  responsibility, it highlight  
<em> responsibilit<em>y.

Basically, the highlighting goes off by 1 character/space consistently.
This problem only happens in content field, and not in any other fields.

I've made some minor modification in the code under JiebaSegmenter.java, and 
the highlighting seems to be fine now.

Basically, I created another int called offset2 under process() method.
int offset2 = 0; 
After which, I modified the offset to offset2 for this part of the code under 
process() method. 
The changes are in the attachment below.


  was:
When I tried to use the JiebaTokenizerFactory to index Chinese characters in 
Solr, it works fine with the segmentation when I'm using the Analysis function 
on the Solr Admin UI.

However, when I tried to do the highlighting in Solr, it is not highlighting in 
the correct place. For example, when I search of 自然环境与企业本身, it highlight 
认<em>为自然环</em><em>境</em><em>与企</em><em>业本</em>身的

Even when I search for English character like  responsibility, it highlight  
<em> responsibilit<em>y.

Basically, the highlighting goes off by 1 character/space consistently.

This problem only happens in content field, and not in any other fields.

I've made some minor modification in the code under JiebaSegmenter.java, and 
the highlighting seems to be fine now.

Basically, I created another int called offset2 under process() method.
int offset2 = 0; 

After which, I modified the offset to offset2 for this part of the code under 
process() method. 


> Highlighting content field problem when using JiebaTokenizerFactory
> -------------------------------------------------------------------
>
>                 Key: SOLR-8334
>                 URL: https://issues.apache.org/jira/browse/SOLR-8334
>             Project: Solr
>          Issue Type: Bug
>          Components: highlighter, search
>    Affects Versions: 5.3
>         Environment: Windows 8.1, Solr 5.3, ZooKeeper 3.4.6
>            Reporter: Yeo Zheng Lin
>              Labels: patch
>         Attachments: JiebaSegmenter.java
>
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> When I tried to use the JiebaTokenizerFactory to index Chinese characters in 
> Solr, it works fine with the segmentation when I'm using the Analysis 
> function on the Solr Admin UI.
> However, when I tried to do the highlighting in Solr, it is not highlighting 
> in the correct place. For example, when I search of 自然环境与企业本身, it highlight 
> 认<em>为自然环</em><em>境</em><em>与企</em><em>业本</em>身的
> Even when I search for English character like  responsibility, it highlight  
> <em> responsibilit<em>y.
> Basically, the highlighting goes off by 1 character/space consistently.
> This problem only happens in content field, and not in any other fields.
> I've made some minor modification in the code under JiebaSegmenter.java, and 
> the highlighting seems to be fine now.
> Basically, I created another int called offset2 under process() method.
> int offset2 = 0; 
> After which, I modified the offset to offset2 for this part of the code under 
> process() method. 
> The changes are in the attachment below.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to