[jira] [Commented] (LUCENE-10607) NRTSuggesterBuilder扩展input时溢出

ChasenYang (Jira) Thu, 09 Jun 2022 05:49:35 -0700


    [ 
https://issues.apache.org/jira/browse/LUCENE-10607?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17552189#comment-17552189
 ]


ChasenYang commented on LUCENE-10607:
-------------------------------------

谢谢你的回答，我会尝试创建一个pull request，还有个其他的问题也请教一下？
 # 
仍旧在NRTSuggesterBuilder构建索引的过程中，固定一个`analyzed`，当entries长度过长（3000000）时，fstCompiler.add()会很慢，进而导致segment
 merge很慢，单独suggest合并大约耗时5000秒，可以参照以下的例子：

{code:java}
//构建一个NRTSuggesterBuilder，然后对相同的input,写入300万长度的output数组，权重递增时，写入较慢。
long start = System.currentTimeMillis();
NRTSuggesterBuilder builder = new NRTSuggesterBuilder();
String analyWord = "TestStr";
String oriWord = "OriStr";
BytesRef awRef = new BytesRef(analyWord.getBytes());
BytesRef oriWordRef = new BytesRef(oriWord.getBytes());
builder.startTerm(awRef);

//构造300w长度的原词数组，分词统一为aw
for (int i = 0; i < 3000000; i++) {
    builder.addEntry(i, oriWordRef, i);
}
long end = System.currentTimeMillis();
System.out.println("add to entry cost " + (end-start) + "ms");

builder.finishTerm();
long finishTermEnd = System.currentTimeMillis();
System.out.println("finish term cost " + (finishTermEnd-end) + "ms");
System.out.println("build " + " 3000000 with oriword:" + oriWord + ",aword:" + 
analyWord +", cost "+
        (finishTermEnd - start) + "ms"); {code}
    2. 
在NRTSuggesterBuilder的finishTerm的实现中，m_entries是PriorityQueue，按照权重排序，而在遍历过程中，使用for循环遍历，打乱了排序。这样做是否有其他含义？
{code:java}
public void finishTerm() throws IOException {
  int numArcs = 0;
  int numDedupBytes = 1;
  analyzed.grow(analyzed.length() + 1);
  analyzed.setLength(analyzed.length() + 1);
  for (Entry entry : entries) { 
//这里使用for循环来遍历的话，是不是entries本身不需要使用Priority来存储，是否可以使用ArrayList可以代替?
    if (numArcs == maxNumArcsForDedupByte(numDedupBytes)) {
      analyzed.setByteAt(analyzed.length() - 1, (byte) (numArcs));
      analyzed.grow(analyzed.length() + 1);
      analyzed.setLength(analyzed.length() + 1);
      numArcs = 0;
      numDedupBytes++;
    }
    analyzed.setByteAt(analyzed.length() - 1, (byte) numArcs++);
    Util.toIntsRef(analyzed.get(), scratchInts);
    fstCompiler.add(scratchInts.get(), outputs.newPair(entry.weight, 
entry.payload));
  }
  maxAnalyzedPathsPerOutput = Math.max(maxAnalyzedPathsPerOutput, 
entries.size());
  entries.clear();
} {code}

> NRTSuggesterBuilder扩展input时溢出
> -----------------------------
>
>                 Key: LUCENE-10607
>                 URL: https://issues.apache.org/jira/browse/LUCENE-10607
>             Project: Lucene - Core
>          Issue Type: Bug
>          Components: core/search
>    Affects Versions: 9.2
>            Reporter: ChasenYang
>            Priority: Major
>
> suggest模块在创建索引时，调用NRTSuggestBuilder的finishTerm来写入suggest索引。
> 会调用maxNumArcsForDedupByte函数来扩展analyzed,向后扩展3 5 7 .... 255。
> 当entries长度过长（9000000）时，调用maxNumArcsForDedupByte扩展时
>  
> private static int maxNumArcsForDedupByte(int currentNumDedupBytes) {
> int maxArcs = 1 + (2 * currentNumDedupBytes);
> if (currentNumDedupBytes > 5)
> { maxArcs *= currentNumDedupBytes;  
> //当currentNumDedupBytes大于等于32768时，int相乘会大于int最大值 }
> return Math.min(maxArcs, 255);
> }
>  
> 另外在扩展时，是否可以选择固定4字节来有序扩展。代替 3 5 7 ... 255的扩展方式
>  



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (LUCENE-10607) NRTSuggesterBuilder扩展input时溢出

Reply via email to