[ 
https://issues.apache.org/jira/browse/SOLR-5855?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Daniel Debray updated SOLR-5855:
--------------------------------

    Fix Version/s: 5.0

> Increasing solr highlight performance with caching
> --------------------------------------------------
>
>                 Key: SOLR-5855
>                 URL: https://issues.apache.org/jira/browse/SOLR-5855
>             Project: Solr
>          Issue Type: Improvement
>          Components: highlighter
>    Affects Versions: 5.0
>            Reporter: Daniel Debray
>             Fix For: 5.0
>
>         Attachments: highlight.patch
>
>
> Hi folks,
> while investigating possible performance bottlenecks in the highlight 
> component i discovered two places where we can save some cpu cylces.
> Both are in the class org.apache.solr.highlight.DefaultSolrHighlighter
> First in method doHighlighting (lines 411-417):
> In the loop we try to highlight every field that has been resolved from the 
> params on each document. Ok, but why not skip those fields that are not 
> present on the current document? 
> So i changed the code from:
> for (String fieldName : fieldNames) {
>   fieldName = fieldName.trim();
>   if( useFastVectorHighlighter( params, schema, fieldName ) )
>     doHighlightingByFastVectorHighlighter( fvh, fieldQuery, req, 
> docSummaries, docId, doc, fieldName );
>   else
>     doHighlightingByHighlighter( query, req, docSummaries, docId, doc, 
> fieldName );
> }
> to:
> for (String fieldName : fieldNames) {
>   fieldName = fieldName.trim();
>   if (doc.get(fieldName) != null) {
>     if( useFastVectorHighlighter( params, schema, fieldName ) )
>       doHighlightingByFastVectorHighlighter( fvh, fieldQuery, req, 
> docSummaries, docId, doc, fieldName );
>     else
>       doHighlightingByHighlighter( query, req, docSummaries, docId, doc, 
> fieldName );
>   }
> }
> The second place is where we try to retrieve the TokenStream from the 
> document for a specific field.
> line 472:
> TokenStream tvStream = 
> TokenSources.getTokenStreamWithOffsets(searcher.getIndexReader(), docId, 
> fieldName);
> where..
> public static TokenStream getTokenStreamWithOffsets(IndexReader reader, int 
> docId, String field) throws IOException {
>   Fields vectors = reader.getTermVectors(docId);
>   if (vectors == null) {
>     return null;
>   }
>   Terms vector = vectors.terms(field);
>   if (vector == null) {
>     return null;
>   }
>   if (!vector.hasPositions() || !vector.hasOffsets()) {
>     return null;
>   }
>   return getTokenStream(vector);
> }
> keep in mind that we currently hit the IndexReader n times where n = 
> requested rows(documents) * requested amount of highlight fields.
> in my usecase reader.getTermVectors(docId) takes around 150.000~250.000ns on 
> a warm solr and 1.100.000ns on a cold solr.
> If we store the returning Fields vectors in a cache, this lookups only take 
> 25000ns.
> I would suggest something like the following code in the 
> doHighlightingByHighlighter method in the DefaultSolrHighlighter class (line 
> 472):
> Fields vectors = null;
> SolrCache termVectorCache = searcher.getCache("termVectorCache");
> if (termVectorCache != null) {
>   vectors = (Fields) termVectorCache.get(Integer.valueOf(docId));
>   if (vectors == null) {
>     vectors = searcher.getIndexReader().getTermVectors(docId);
>     if (vectors != null) termVectorCache.put(Integer.valueOf(docId), vectors);
>   } 
> } else {
>   vectors = searcher.getIndexReader().getTermVectors(docId);
> }
> TokenStream tvStream = TokenSources.getTokenStreamWithOffsets(vectors, 
> fieldName);
> and TokenSources class:
> public static TokenStream getTokenStreamWithOffsets(Fields vectors, String 
> field) throws IOException {
>   if (vectors == null) {
>     return null;
>   }
>   Terms vector = vectors.terms(field);
>   if (vector == null) {
>     return null;
>   }
>   if (!vector.hasPositions() || !vector.hasOffsets()) {
>     return null;
>   }
>   return getTokenStream(vector);
> }
> 4000ms on 1000 docs without cache
> 639ms on 1000 docs with cache
> 102ms on 30 docs without cache
> 22ms on 30 docs with cache
> on an index with 190.000 docs with a numFound of 32000 and 80 different 
> highlight fields.
> I think querys with only one field to highlight on a document does not 
> benefit that much from a cache like this, thats why i think an optional cache 
> would be the best solution there. 
> As i saw the FastVectorHighlighter uses more or less the same approach and 
> could also benefit from this cache.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Reply via email to