[
https://issues.apache.org/jira/browse/LUCENE-7397?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15394825#comment-15394825
]
donghyun Kim commented on LUCENE-7397:
--------------------------------------
when each searchDoc's highlightPhase, It calls highlighter's(In this case, FVH)
highlight(highlighterContext).
In FastVectorHighlighter.java, loop for each requested 'HighlightedField' and
getBestFragments.
getBestFragments method receive parameter[2] hitContext.docId() that uses for
getting termVector of doc.
It finally reach org.apache.lucene.search.vectorhighlight.FieldTermStack.java
and get termVector of doc.
The problem is
'every highlightedField's getBestFragments method' call
final Fields vectors = reader.getTermVectors(docId);
and it seems Inefficiently slow when search highlight result include (big
document && many highlightedField). read whole doc's termvector with every
highlightedField.
my testing machine:
quad 1.87 ghz,
8Gb memory,
spinning disk.
ES-1.5.2 (relevent code not changed, when I saw)
Example,
my query :
`
{
"from" : 0,
"size" : 20,
"query" : {
"query about highlighted field and more"},
"explain" : false,
"fields" : [ "highlight", "fileRevision", "ownerNameUnigram", "ownerName",
"ownerId", "timeLastModified", "size" ],
"sort" : [ {
"_score" : { }
} ],
"highlight" : {
"pre_tags" : [ "" ],
"post_tags" : [ "" ],
"order" : "score",
"fragment_size" : 128,
"number_of_fragments" : 10,
"require_field_match" : true,
"type" : "fvh",
"fields" : [ {
"ownerName" : { }
}, {
"fileName_ko" : { }
}, {
"fileName_en" : { }
}, {
"fileName_id" : { }
}, {
"fileName_es" : { }
}, {
"fileName_zh" : { }
}, {
"fileName_ja" : { }
}, {
"fileName_it" : { }
}, {
"fileName_ru" : { }
}, {
"fileName_pt" : { }
}, {
"fileName_hi" : { }
}, {
"fileName_etc" : { }
}, {
"contents_ko" : { }
}, {
"contents_ko.ngram" : { }
}, {
"contents_en" : { }
}, {
"contents_en.ngram" : { }
}, {
"contents_id" : { }
}, {
"contents_id.ngram" : { }
}, {
"contents_es" : { }
}, {
"contents_es.ngram" : { }
}, {
"contents_zh" : { }
}, {
"contents_zh.ngram" : { }
}, {
"contents_ja" : { }
}, {
"contents_ja.ngram" : { }
}, {
"contents_it" : { }
}, {
"contents_it.ngram" : { }
}, {
"contents_ru" : { }
}, {
"contents_ru.ngram" : { }
}, {
"contents_pt" : { }
}, {
"contents_pt.ngram" : { }
}, {
"contents_hi" : { }
}, {
"contents_hi.ngram" : { }
}, {
"contents_etc" : { }
}, {
"contents_etc.ngram" : { }
} ]
}
}
`
Test
[tookTime in millis, getBestFragments]
my doc 12538's every field getBestFragments took about 20ms.
and total highlight phase tooks 705 ms.
I have a sparse mapping field. that means
'doc 12538' field fileName_*
[fileName_id , fileName_es, fileName_zh, fileName_ja, fileName_it, fileName_ru,
fileName_pt, fileName_hi, fileName_etc]
only one field is filled with data among this array.
It's same to contents_* field.
dangerous doc 12538 -
CONVOCADOS_TALLER_SALUDMENTAL_JULIO2014.xlsx.txt
[2016-07-26 16:57:04,043][INFO ][root ] [4][FastVectorHighlighter.highlight]
[22], filedName : fileName_id, docId : 12538
[23], filedName : fileName_id, docId : 12538
[20], filedName : fileName_es, docId : 12538
[21], filedName : fileName_zh, docId : 12538
[21], filedName : fileName_ja, docId : 12538
[28], filedName : fileName_it, docId : 12538
[26], filedName : fileName_ru, docId : 12538
[24], filedName : fileName_pt, docId : 12538
[22], filedName : fileName_hi, docId : 12538
[22], filedName : fileName_etc, docId : 12538
[22], filedName : contents_ko, docId : 12538
[20], filedName : contents_ko.ngram, docId : 12538
[19], filedName : contents_en, docId : 12538
[19], filedName : contents_en.ngram, docId : 12538
[20], filedName : contents_id, docId : 12538
[20], filedName : contents_id.ngram, docId : 12538
[19], filedName : contents_es, docId : 12538
[18], filedName : contents_es.ngram, docId : 12538
[19], filedName : contents_zh, docId : 12538
[19], filedName : contents_zh.ngram, docId : 12538
[19], filedName : contents_ja, docId : 12538
[19], filedName : contents_ja.ngram, docId : 12538
[18], filedName : contents_it, docId : 12538
[18], filedName : contents_it.ngram, docId : 12538
[18], filedName : contents_ru, docId : 12538
[18], filedName : contents_ru.ngram, docId : 12538
[20], filedName : contents_pt.ngram, docId : 12538
[18], filedName : contents_hi, docId : 12538
[18], filedName : contents_hi.ngram, docId : 12538
[18], filedName : contents_etc, docId : 12538
[19], filedName : contents_etc.ngram, docId : 12538
[2016-07-26 16:57:04,654][INFO ][root ] highlight tooks : 705, docId : 12538
and...
reader.getTermVectors(docId) tooks.
I didn't log sync with getBestFragments took. but i can see rough sequence and
how it tooks.
I think heavy analyzed doc (have big termvectors) impact my query. (around 20ms
sequence.)
long tTime = System.currentTimeMillis();
final Fields vectors = reader.getTermVectors(docId);
termVectorTimeLogging("tVectorTime : "+(System.currentTimeMillis() - tTime));
tVectorTime : 1
tVectorTime : 1
tVectorTime : 24
tVectorTime : 24
tVectorTime : 23
tVectorTime : 21
tVectorTime : 20
tVectorTime : 19
tVectorTime : 19
tVectorTime : 48
tVectorTime : 19
tVectorTime : 18
tVectorTime : 18
tVectorTime : 18
tVectorTime : 18
tVectorTime : 18
tVectorTime : 18
tVectorTime : 18
tVectorTime : 19
tVectorTime : 20
tVectorTime : 20
tVectorTime : 20
tVectorTime : 20
tVectorTime : 20
tVectorTime : 20
tVectorTime : 20
tVectorTime : 20
tVectorTime : 20
tVectorTime : 20
tVectorTime : 1
tVectorTime : 1
tVectorTime : 0
tVectorTime : 0
tVectorTime : 0
tVectorTime : 0
tVectorTime : 0
tVectorTime : 1
tVectorTime : 1
tVectorTime : 1
tVectorTime : 1
tVectorTime : 0
tVectorTime : 0
tVectorTime : 1
tVectorTime : 1
tVectorTime : 1
tVectorTime : 0
tVectorTime : 0
tVectorTime : 0
tVectorTime : 0
tVectorTime : 0
tVectorTime : 1
tVectorTime : 1
tVectorTime : 1
tVectorTime : 0
tVectorTime : 0
tVectorTime : 0
tVectorTime : 0
tVectorTime : 0
tVectorTime : 1
tVectorTime : 1
tVectorTime : 1
tVectorTime : 1
tVectorTime : 0
tVectorTime : 1
tVectorTime : 0
tVectorTime : 1
tVectorTime : 1
tVectorTime : 0
tVectorTime : 0
tVectorTime : 0
tVectorTime : 1
tVectorTime : 0
tVectorTime : 0
tVectorTime : 0
tVectorTime : 1
> Inefficient FVhighlighting when set many HighlightedField.
> ----------------------------------------------------------
>
> Key: LUCENE-7397
> URL: https://issues.apache.org/jira/browse/LUCENE-7397
> Project: Lucene - Core
> Issue Type: Bug
> Components: core/search
> Environment: CentOS release 6.4 (Final)
> quad core 1.87
> 8gb memory
> Elasticsearch - 1.5 with lucene 4.10.4
> Reporter: donghyun Kim
> Priority: Minor
>
> when highlighting result
> org.apache.lucene.search.vectorhighlight.FastVectorHighlighter.java
> getBestFragment method ~ FieldTermStack.java read termvector every
> highlighted field.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]