[ https://issues.apache.org/jira/browse/LUCENE-10607?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17552189#comment-17552189 ]
ChasenYang commented on LUCENE-10607: ------------------------------------- 谢谢你的回答,我会尝试创建一个pull request,还有个其他的问题也请教一下? # 仍旧在NRTSuggesterBuilder构建索引的过程中,固定一个`analyzed`,当entries长度过长(3000000)时,fstCompiler.add()会很慢,进而导致segment merge很慢,单独suggest合并大约耗时5000秒,可以参照以下的例子: {code:java} //构建一个NRTSuggesterBuilder,然后对相同的input,写入300万长度的output数组,权重递增时,写入较慢。 long start = System.currentTimeMillis(); NRTSuggesterBuilder builder = new NRTSuggesterBuilder(); String analyWord = "TestStr"; String oriWord = "OriStr"; BytesRef awRef = new BytesRef(analyWord.getBytes()); BytesRef oriWordRef = new BytesRef(oriWord.getBytes()); builder.startTerm(awRef); //构造300w长度的原词数组,分词统一为aw for (int i = 0; i < 3000000; i++) { builder.addEntry(i, oriWordRef, i); } long end = System.currentTimeMillis(); System.out.println("add to entry cost " + (end-start) + "ms"); builder.finishTerm(); long finishTermEnd = System.currentTimeMillis(); System.out.println("finish term cost " + (finishTermEnd-end) + "ms"); System.out.println("build " + " 3000000 with oriword:" + oriWord + ",aword:" + analyWord +", cost "+ (finishTermEnd - start) + "ms"); {code} 2. 在NRTSuggesterBuilder的finishTerm的实现中,m_entries是PriorityQueue,按照权重排序,而在遍历过程中,使用for循环遍历,打乱了排序。这样做是否有其他含义? {code:java} public void finishTerm() throws IOException { int numArcs = 0; int numDedupBytes = 1; analyzed.grow(analyzed.length() + 1); analyzed.setLength(analyzed.length() + 1); for (Entry entry : entries) { //这里使用for循环来遍历的话,是不是entries本身不需要使用Priority来存储,是否可以使用ArrayList可以代替? if (numArcs == maxNumArcsForDedupByte(numDedupBytes)) { analyzed.setByteAt(analyzed.length() - 1, (byte) (numArcs)); analyzed.grow(analyzed.length() + 1); analyzed.setLength(analyzed.length() + 1); numArcs = 0; numDedupBytes++; } analyzed.setByteAt(analyzed.length() - 1, (byte) numArcs++); Util.toIntsRef(analyzed.get(), scratchInts); fstCompiler.add(scratchInts.get(), outputs.newPair(entry.weight, entry.payload)); } maxAnalyzedPathsPerOutput = Math.max(maxAnalyzedPathsPerOutput, entries.size()); entries.clear(); } {code} > NRTSuggesterBuilder扩展input时溢出 > ----------------------------- > > Key: LUCENE-10607 > URL: https://issues.apache.org/jira/browse/LUCENE-10607 > Project: Lucene - Core > Issue Type: Bug > Components: core/search > Affects Versions: 9.2 > Reporter: ChasenYang > Priority: Major > > suggest模块在创建索引时,调用NRTSuggestBuilder的finishTerm来写入suggest索引。 > 会调用maxNumArcsForDedupByte函数来扩展analyzed,向后扩展3 5 7 .... 255。 > 当entries长度过长(9000000)时,调用maxNumArcsForDedupByte扩展时 > > private static int maxNumArcsForDedupByte(int currentNumDedupBytes) { > int maxArcs = 1 + (2 * currentNumDedupBytes); > if (currentNumDedupBytes > 5) > { maxArcs *= currentNumDedupBytes; > //当currentNumDedupBytes大于等于32768时,int相乘会大于int最大值 } > return Math.min(maxArcs, 255); > } > > 另外在扩展时,是否可以选择固定4字节来有序扩展。代替 3 5 7 ... 255的扩展方式 > -- This message was sent by Atlassian Jira (v8.20.7#820007) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org