[jira] [Commented] (SOLR-5855) re-use document term-vector Fields instance across fields in the DefaultSolrHighlighter

Ere Maijala (JIRA) Mon, 08 Jun 2015 23:52:35 -0700

    [ 
https://issues.apache.org/jira/browse/SOLR-5855?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14578437#comment-14578437
 ]


Ere Maijala commented on SOLR-5855:
-----------------------------------

I was hoping, perhaps naively, that this would help with the highlighter 
performance problems we're having with Solr 5. Unfortunately this doesn't seems 
to have made a difference. Using hl.usePhraseHighlighter=false has a 
significant effect, but obviously with downsides and still much slower than 
4.10.2.

For what it's worth, here is some additional information:

Timing from Solr 4.10.2 (42.5 million records):

            "process": {
                "time": 1711,
                "query": {
                    "time": 0
                },
                "facet": {
                    "time": 66
                },
                "mlt": {
                    "time": 0
                },
                "highlight": {
                    "time": 708
                },
                "stats": {
                    "time": 0
                },
                "expand": {
                    "time": 0
                },
                "spellcheck": {
                    "time": 433
                },
                "debug": {
                    "time": 503
                }
            }

Timing from Solr 5.2.0 (38.8 million records):

            "process": {
                "time": 10172,
                "query": {
                    "time": 0
                },
                "facet": {
                    "time": 45
                },
                "facet_module": {
                    "time": 0
                },
                "mlt": {
                    "time": 0
                },
                "highlight": {
                    "time": 9310
                },
                "stats": {
                    "time": 0
                },
                "expand": {
                    "time": 0
                },
                "spellcheck": {
                    "time": 345
                },
                "debug": {
                    "time": 472
                }
            }

A couple of jstack outputs during the query execution are here: 
http://pastebin.com/8FJiq5R3. The schema and solrconfig are at 
https://github.com/NatLibFi/NDL-VuFind-Solr/tree/master/vufind/biblio/conf. 

> re-use document term-vector Fields instance across fields in the 
> DefaultSolrHighlighter
> ---------------------------------------------------------------------------------------
>
>                 Key: SOLR-5855
>                 URL: https://issues.apache.org/jira/browse/SOLR-5855
>             Project: Solr
>          Issue Type: Improvement
>          Components: highlighter
>    Affects Versions: Trunk
>            Reporter: Daniel Debray
>            Assignee: David Smiley
>             Fix For: 5.2
>
>         Attachments: SOLR-5855-without-cache.patch, 
> SOLR-5855_with_FVH_support.patch, SOLR-5855_with_FVH_support.patch, 
> highlight.patch
>
>
> Hi folks,
> while investigating possible performance bottlenecks in the highlight 
> component i discovered two places where we can save some cpu cylces.
> Both are in the class org.apache.solr.highlight.DefaultSolrHighlighter
> First in method doHighlighting (lines 411-417):
> In the loop we try to highlight every field that has been resolved from the 
> params on each document. Ok, but why not skip those fields that are not 
> present on the current document? 
> So i changed the code from:
> for (String fieldName : fieldNames) {
>   fieldName = fieldName.trim();
>   if( useFastVectorHighlighter( params, schema, fieldName ) )
>     doHighlightingByFastVectorHighlighter( fvh, fieldQuery, req, 
> docSummaries, docId, doc, fieldName );
>   else
>     doHighlightingByHighlighter( query, req, docSummaries, docId, doc, 
> fieldName );
> }
> to:
> for (String fieldName : fieldNames) {
>   fieldName = fieldName.trim();
>   if (doc.get(fieldName) != null) {
>     if( useFastVectorHighlighter( params, schema, fieldName ) )
>       doHighlightingByFastVectorHighlighter( fvh, fieldQuery, req, 
> docSummaries, docId, doc, fieldName );
>     else
>       doHighlightingByHighlighter( query, req, docSummaries, docId, doc, 
> fieldName );
>   }
> }
> The second place is where we try to retrieve the TokenStream from the 
> document for a specific field.
> line 472:
> TokenStream tvStream = 
> TokenSources.getTokenStreamWithOffsets(searcher.getIndexReader(), docId, 
> fieldName);
> where..
> public static TokenStream getTokenStreamWithOffsets(IndexReader reader, int 
> docId, String field) throws IOException {
>   Fields vectors = reader.getTermVectors(docId);
>   if (vectors == null) {
>     return null;
>   }
>   Terms vector = vectors.terms(field);
>   if (vector == null) {
>     return null;
>   }
>   if (!vector.hasPositions() || !vector.hasOffsets()) {
>     return null;
>   }
>   return getTokenStream(vector);
> }
> keep in mind that we currently hit the IndexReader n times where n = 
> requested rows(documents) * requested amount of highlight fields.
> in my usecase reader.getTermVectors(docId) takes around 150.000~250.000ns on 
> a warm solr and 1.100.000ns on a cold solr.
> If we store the returning Fields vectors in a cache, this lookups only take 
> 25000ns.
> I would suggest something like the following code in the 
> doHighlightingByHighlighter method in the DefaultSolrHighlighter class (line 
> 472):
> Fields vectors = null;
> SolrCache termVectorCache = searcher.getCache("termVectorCache");
> if (termVectorCache != null) {
>   vectors = (Fields) termVectorCache.get(Integer.valueOf(docId));
>   if (vectors == null) {
>     vectors = searcher.getIndexReader().getTermVectors(docId);
>     if (vectors != null) termVectorCache.put(Integer.valueOf(docId), vectors);
>   } 
> } else {
>   vectors = searcher.getIndexReader().getTermVectors(docId);
> }
> TokenStream tvStream = TokenSources.getTokenStreamWithOffsets(vectors, 
> fieldName);
> and TokenSources class:
> public static TokenStream getTokenStreamWithOffsets(Fields vectors, String 
> field) throws IOException {
>   if (vectors == null) {
>     return null;
>   }
>   Terms vector = vectors.terms(field);
>   if (vector == null) {
>     return null;
>   }
>   if (!vector.hasPositions() || !vector.hasOffsets()) {
>     return null;
>   }
>   return getTokenStream(vector);
> }
> 4000ms on 1000 docs without cache
> 639ms on 1000 docs with cache
> 102ms on 30 docs without cache
> 22ms on 30 docs with cache
> on an index with 190.000 docs with a numFound of 32000 and 80 different 
> highlight fields.
> I think querys with only one field to highlight on a document does not 
> benefit that much from a cache like this, thats why i think an optional cache 
> would be the best solution there. 
> As i saw the FastVectorHighlighter uses more or less the same approach and 
> could also benefit from this cache.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (SOLR-5855) re-use document term-vector Fields instance across fields in the DefaultSolrHighlighter

Reply via email to