ChenYongkang created LUCENENET-599:
--------------------------------------
Summary: Fine-grained segmentation tools with vectorHighlight will
cause bug
Key: LUCENENET-599
URL: https://issues.apache.org/jira/browse/LUCENENET-599
Project: Lucene.Net
Issue Type: Improvement
Components: Lucene.Net Core, Lucene.Net.Highlighter
Affects Versions: Lucene.Net 4.8.0
Environment: System:
Linux version 4.4.0-62-generic (buildd@lcy01-30) (gcc version 5.4.0 20160609
(Ubuntu 5.4.0-6ubuntu1~16.04.4) )
Lucene Version :Lucene4.8.0-beta00005
Participle tool:JIEba
Reporter: ChenYongkang
the text to analyze :
"主体内容来自并且自己加了点基本数据结构数组链表,双向链表"
when I used fine-graine service and it was token to :
"
主体/ 内容/ 来自/ 并且/ 自己/ 加/ 了/ 点/ 基本/ 数据/ 结构/ 数据结构/ 数组/ 链表/ ,/ 双向/ 链表
"
I searched with query “数据,基本数据结构” and got wrong :
System.ArgumentOutOfRangeException: Index and length must refer to a location
within the string.
Parameter name: length
at System.String.Substring(Int32 startIndex, Int32 length)
at
Lucene.Net.Search.VectorHighlight.BaseFragmentsBuilder.MakeFragment(StringBuilder
buffer, Int32[] index, Field[] values, WeightedFragInfo fragInfo, String[]
preTags, String[] postTags, IEncoder encoder) in
C:\BuildAgent\work\b1b63ca15b99dddb\src\Lucene.Net.Highlighter\VectorHighlight\BaseFragmentsBuilder.cs:line
195
at
Lucene.Net.Search.VectorHighlight.BaseFragmentsBuilder.CreateFragments(IndexReader
reader, Int32 docId, String fieldName, FieldFragList fieldFragList, Int32
maxNumFragments, String[] preTags, String[] postTags, IEncoder encoder) in
C:\BuildAgent\work\b1b63ca15b99dddb\src\Lucene.Net.Highlighter\VectorHighlight\BaseFragmentsBuilder.cs:line
146
at
Lucene.Net.Search.VectorHighlight.BaseFragmentsBuilder.CreateFragments(IndexReader
reader, Int32 docId, String fieldName, FieldFragList fieldFragList, Int32
maxNumFragments) in
C:\BuildAgent\work\b1b63ca15b99dddb\src\Lucene.Net.Highlighter\VectorHighlight\BaseFragmentsBuilder.cs:line
99
The reason is the code in vectorHighlighter:
1. protected String makeFragment( StringBuilder buffer, int[] index, Field[]
values, WeightedFragInfo fragInfo,
2. String[] preTags, String[] postTags, Encoder encoder ){
3. StringBuilder fragment = new StringBuilder();
4. final int s = fragInfo.getStartOffset();
5. int[] modifiedStartOffset = \{ s };
6. String src = getFragmentSourceMSO( buffer, index, values, s,
fragInfo.getEndOffset(), modifiedStartOffset );
7. int srcIndex = 0;
8. for( SubInfo subInfo : fragInfo.getSubInfos() ){
9. for( Toffs to : subInfo.getTermsOffsets() ){
10. fragment
11. .append( encoder.encodeText( src.substring( srcIndex,
to.getStartOffset() - modifiedStartOffset[0] ) ) )
12. .append( getPreTag( preTags, subInfo.getSeqnum() ) )
13. .append( encoder.encodeText( src.substring( to.getStartOffset() -
modifiedStartOffset[0], to.getEndOffset() - modifiedStartOffset[0] ) ) )
14. .append( getPostTag( postTags, subInfo.getSeqnum() ) );
15. srcIndex = to.getEndOffset() - modifiedStartOffset[0];
16. }
17. }
18. fragment.append( encoder.encodeText( src.substring( srcIndex ) ) );
19. return fragment.toString();
20. }
when I searched with "基本数据结构" and it was ok. My English is pool .I will
explain reason with Chinese.
细粒度分词会把“基本数据结构”再次分词,当我们搜索“数据,基本数据结构”,
数据分词被第一个高亮,因为上面的分词,“数据”在“基本数据结构”前面,而数据在文本中的起始位置是(15,16),对“数据”高亮之后,srcIndex
会变成“数据”的末位置,也就是16,从16开始找下一个高亮分词,下一个分词“基本数据结构”的位置(13,18)。src.substring(16,13)高亮前的片段,显示是错误的。
所以快速分词基于的是分词在原文本中的顺序是前后衔接的,当你使用细粒度分词的时候就打破了这种衔接,会导致报错。但是作为搜索引擎,很多时候都是细粒度分词,搜索的时候使用快速高亮也可以提高速度,然而二者不能很好的结合。
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)