[jira] [Commented] (LUCENE-7397) Inefficient FVhighlighting when set many HighlightedField.

donghyun Kim (JIRA) Tue, 26 Jul 2016 17:16:03 -0700

    [ 
https://issues.apache.org/jira/browse/LUCENE-7397?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15394825#comment-15394825
 ]


donghyun Kim commented on LUCENE-7397:
--------------------------------------



when each searchDoc's highlightPhase, It calls highlighter's(In this case, FVH) 
highlight(highlighterContext).
In FastVectorHighlighter.java, loop for each requested 'HighlightedField' and 
getBestFragments.
getBestFragments method receive parameter[2] hitContext.docId() that uses for 
getting termVector of doc.
It finally reach org.apache.lucene.search.vectorhighlight.FieldTermStack.java 
and get termVector of doc.

The problem is 
'every highlightedField's getBestFragments method' call
 final Fields vectors = reader.getTermVectors(docId); 

and it seems Inefficiently slow when search highlight result include (big 
document && many highlightedField). read whole doc's termvector with every 
highlightedField.

my testing machine:
quad 1.87 ghz,
8Gb memory,
spinning disk.
ES-1.5.2 (relevent code not changed, when I saw)

Example,
my query :
`
{
"from" : 0,
"size" : 20,

"query" : {
"query about highlighted field and more"},
"explain" : false,
"fields" : [ "highlight", "fileRevision", "ownerNameUnigram", "ownerName", 
"ownerId", "timeLastModified", "size" ],
"sort" : [ {
"_score" : { }
} ],
"highlight" : {
"pre_tags" : [ "" ],
"post_tags" : [ "" ],
"order" : "score",
"fragment_size" : 128,
"number_of_fragments" : 10,
"require_field_match" : true,
"type" : "fvh",
"fields" : [ {
"ownerName" : { }
}, {
"fileName_ko" : { }
}, {
"fileName_en" : { }
}, {
"fileName_id" : { }
}, {
"fileName_es" : { }
}, {
"fileName_zh" : { }
}, {
"fileName_ja" : { }
}, {
"fileName_it" : { }
}, {
"fileName_ru" : { }
}, {
"fileName_pt" : { }
}, {
"fileName_hi" : { }
}, {
"fileName_etc" : { }
}, {
"contents_ko" : { }
}, {
"contents_ko.ngram" : { }
}, {
"contents_en" : { }
}, {
"contents_en.ngram" : { }
}, {
"contents_id" : { }
}, {
"contents_id.ngram" : { }
}, {
"contents_es" : { }
}, {
"contents_es.ngram" : { }
}, {
"contents_zh" : { }
}, {
"contents_zh.ngram" : { }
}, {
"contents_ja" : { }
}, {
"contents_ja.ngram" : { }
}, {
"contents_it" : { }
}, {
"contents_it.ngram" : { }
}, {
"contents_ru" : { }
}, {
"contents_ru.ngram" : { }
}, {
"contents_pt" : { }
}, {
"contents_pt.ngram" : { }
}, {
"contents_hi" : { }
}, {
"contents_hi.ngram" : { }
}, {
"contents_etc" : { }
}, {
"contents_etc.ngram" : { }
} ]
}
}
`

Test
[tookTime in millis, getBestFragments]
my doc 12538's every field getBestFragments took about 20ms. 
and total highlight phase tooks 705 ms.

I have a sparse mapping field. that means
'doc 12538' field fileName_*
[fileName_id , fileName_es, fileName_zh, fileName_ja, fileName_it, fileName_ru, 
fileName_pt, fileName_hi, fileName_etc]
only one field is filled with data among this array.
It's same to contents_* field.

dangerous doc 12538 - 
CONVOCADOS_TALLER_SALUDMENTAL_JULIO2014.xlsx.txt

[2016-07-26 16:57:04,043][INFO ][root ] [4][FastVectorHighlighter.highlight] 
[22], filedName : fileName_id, docId : 12538
[23], filedName : fileName_id, docId : 12538
[20], filedName : fileName_es, docId : 12538
[21], filedName : fileName_zh, docId : 12538
[21], filedName : fileName_ja, docId : 12538
[28], filedName : fileName_it, docId : 12538
[26], filedName : fileName_ru, docId : 12538
[24], filedName : fileName_pt, docId : 12538
[22], filedName : fileName_hi, docId : 12538
[22], filedName : fileName_etc, docId : 12538
[22], filedName : contents_ko, docId : 12538
[20], filedName : contents_ko.ngram, docId : 12538
[19], filedName : contents_en, docId : 12538
[19], filedName : contents_en.ngram, docId : 12538
[20], filedName : contents_id, docId : 12538
[20], filedName : contents_id.ngram, docId : 12538
[19], filedName : contents_es, docId : 12538
[18], filedName : contents_es.ngram, docId : 12538
[19], filedName : contents_zh, docId : 12538
[19], filedName : contents_zh.ngram, docId : 12538
[19], filedName : contents_ja, docId : 12538
[19], filedName : contents_ja.ngram, docId : 12538
[18], filedName : contents_it, docId : 12538
[18], filedName : contents_it.ngram, docId : 12538
[18], filedName : contents_ru, docId : 12538
[18], filedName : contents_ru.ngram, docId : 12538
[20], filedName : contents_pt.ngram, docId : 12538
[18], filedName : contents_hi, docId : 12538
[18], filedName : contents_hi.ngram, docId : 12538
[18], filedName : contents_etc, docId : 12538
[19], filedName : contents_etc.ngram, docId : 12538
[2016-07-26 16:57:04,654][INFO ][root ] highlight tooks : 705, docId : 12538

and...
reader.getTermVectors(docId) tooks.
I didn't log sync with getBestFragments took. but i can see rough sequence and 
how it tooks.
I think heavy analyzed doc (have big termvectors) impact my query. (around 20ms 
sequence.)

long tTime = System.currentTimeMillis();
final Fields vectors = reader.getTermVectors(docId);
termVectorTimeLogging("tVectorTime : "+(System.currentTimeMillis() - tTime));

tVectorTime : 1
tVectorTime : 1
tVectorTime : 24
tVectorTime : 24
tVectorTime : 23
tVectorTime : 21
tVectorTime : 20
tVectorTime : 19
tVectorTime : 19
tVectorTime : 48
tVectorTime : 19
tVectorTime : 18
tVectorTime : 18
tVectorTime : 18
tVectorTime : 18
tVectorTime : 18
tVectorTime : 18
tVectorTime : 18
tVectorTime : 19
tVectorTime : 20
tVectorTime : 20
tVectorTime : 20
tVectorTime : 20
tVectorTime : 20
tVectorTime : 20
tVectorTime : 20
tVectorTime : 20
tVectorTime : 20
tVectorTime : 20
tVectorTime : 1
tVectorTime : 1
tVectorTime : 0
tVectorTime : 0
tVectorTime : 0
tVectorTime : 0
tVectorTime : 0
tVectorTime : 1
tVectorTime : 1
tVectorTime : 1
tVectorTime : 1
tVectorTime : 0
tVectorTime : 0
tVectorTime : 1
tVectorTime : 1
tVectorTime : 1
tVectorTime : 0
tVectorTime : 0
tVectorTime : 0
tVectorTime : 0
tVectorTime : 0
tVectorTime : 1
tVectorTime : 1
tVectorTime : 1
tVectorTime : 0
tVectorTime : 0
tVectorTime : 0
tVectorTime : 0
tVectorTime : 0
tVectorTime : 1
tVectorTime : 1
tVectorTime : 1
tVectorTime : 1
tVectorTime : 0
tVectorTime : 1
tVectorTime : 0
tVectorTime : 1
tVectorTime : 1
tVectorTime : 0
tVectorTime : 0
tVectorTime : 0
tVectorTime : 1
tVectorTime : 0
tVectorTime : 0
tVectorTime : 0
tVectorTime : 1

> Inefficient FVhighlighting when set many HighlightedField.
> ----------------------------------------------------------
>
>                 Key: LUCENE-7397
>                 URL: https://issues.apache.org/jira/browse/LUCENE-7397
>             Project: Lucene - Core
>          Issue Type: Bug
>          Components: core/search
>         Environment: CentOS release 6.4 (Final)
> quad core 1.87
> 8gb memory
> Elasticsearch - 1.5 with lucene 4.10.4
>            Reporter: donghyun Kim
>            Priority: Minor
>
> when highlighting result 
> org.apache.lucene.search.vectorhighlight.FastVectorHighlighter.java
> getBestFragment method ~ FieldTermStack.java read termvector every 
> highlighted field.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (LUCENE-7397) Inefficient FVhighlighting when set many HighlightedField.

Reply via email to