On Tue, Feb 4, 2014 at 12:20 AM, Trejkaz wrote:

> I'm trying to find a precise and reasonably efficient way to highlight
> all occurrences of terms in the query, only highlighting fields which
> match the corresponding fields used in the query. This seems like it
> would be a fairly common requirement in applications. We have an
> existing implementation, but it works by re-reading the entire text
> back through the analyser. This is slow for large text, and sometimes
> we analyse the same text twice - and both variants could well be in
> the query. So I'm looking for a shortcut.
  [snip]

I am in a similiar situation with a web-based application, plus the
content (XML) is dynamically transformed for purposes of rendering,
making using the highlighter features of Lucene problematic from an
intergration perspective.

Our current solution is to do highlighting on the client-side.  When
search happens, the search results from the server includes the parsed
query terms so the client has an idea of which terms to highlight vs
trying to reimplement a complete query string parser in the client.

A problem is that Lucene (we are still on v3.0.3) does not provide a
robust mechanism for extracting the terms of a query.  The following is
the utility method that the server uses to get the terms needed to
support client-side highlighting:

  /**
   * Extract out terms from query.
   * <p><b>IMPLEMENTATION NOTE:</b> Lucene does not provide a robust,
   * single method from extracting the low terms of a query.
   * Experimentation has shown that some Query types
   * {@link Query#extractTerms(Set)} methods do not work, or do
   * not work as desired.  Therefore, this method checks for specific
   * Query types to extract terms.
   * </p>
   * @param   q       Query to extract terms of.
   * @param   r       {@link IndexReader} the executed the query.
   * @param   terms   {@link Term} {@link Set set} to fill; if
   *                  <tt>null</tt>, a newly allocated set will be
   *                  returned.
   * @return  Set of terms.
   */
  public static Set<Term> extractTermsFromQuery(
      Query q,
      IndexReader r,
      Set<Term> terms
  ) {
    if (terms == null) terms = new HashSet<Term>();
    if (q instanceof TermQuery) {
      terms.add(((TermQuery)q).getTerm());

    } else if (q instanceof WildcardQuery) {
      terms.add(((WildcardQuery)q).getTerm());

    } else if (q instanceof PhraseQuery) {
      PhraseQuery pq = (PhraseQuery)q;
      String s = pq.toString(null);
      int i = s.indexOf('"');
      if (i == 0) {
        terms.add(new Term(FIELD_CONTENT,s));
      } else {
        terms.add(new Term(s.substring(0,i-1),s.substring(i)));
      }

    } else if (q instanceof MultiPhraseQuery) {
      ((MultiPhraseQuery)q).extractTerms(terms);

    } else if (q instanceof PrefixQuery) {
      Term t = ((PrefixQuery)q).getPrefix();
      terms.add(new Term(t.field(), t.text()+"*"));

    } else if (q instanceof FuzzyQuery) {
      FuzzyQuery fq = (FuzzyQuery)q;
      try {
        q = fq.rewrite(r);
      } catch (Exception e) {
        log.warn("Error rewriting fuzzy query ["+fq+"]: "+e);
      }
      extractTermsFromQuery(q,r,terms);

    } else if (q instanceof BooleanQuery) {
      for (BooleanClause clause : ((BooleanQuery)q).getClauses()) {
        if (clause.getOccur() != BooleanClause.Occur.MUST_NOT) {
          extractTermsFromQuery(clause.getQuery(),r,terms);
        }
      }

    } else {
      try {
        q.extractTerms(terms);
      } catch (Exception e) {
        log.warn("Caught exception trying to extract terms from query ["+
            q+"]: ", e);
      }
    }
    return terms;
  }

There is client code then that translates the terms extracted in regular
expressions for matching purposes when walking the DOM.  The terms
provided above can contain '*' and '?' characters, so the client code
transforms to equivalent regex pattern.  Our XML->HTML transform
includes contextual information for some nodes so highlighting can be
constrained if the query was included to specific fields.

Not sure if any of this helps you,

--ewh

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Reply via email to