Looking through the Zend Search Lucene source code, I think there's a simple
change that can make it possible to use a custom highlighting system with
ZSL and at least take a step towards solving the highlighting extensibility
problems.


The primary issue with using a custom highlighter with ZSL is that it's
currently difficult to get an array of words to be highlighted from a query. 
This has to be done outside of ZSL and adds unnecessary complexity.
Throughout the various query objects, in the ->highlightMatchesDOM()
methods, the array of words we are looking for is generated, but then made
impossible to access by doing the actual highlighting.


The quick and simple change is this: separate the ->highlightMatchesDOM()
method into ->getMatchedWords() and ->highlightedMatchesDOM().  So, for the
Wildcard query, we have:



    public function getMatchedWords($string)
    {
        $words = array();

        $matchExpression = '/^' . str_replace(array('\\?', '\\*'),
array('.', '.*') , preg_quote($this->_pattern->text, '/')) . '$/';
        if (@preg_match('/\pL/u', 'a') == 1) {
            // PCRE unicode support is turned on
            // add Unicode modifier to the match expression
            $matchExpression .= 'u';
        }

        $tokens =
Zend_Search_Lucene_Analysis_Analyzer::getDefault()->tokenize($string,
'UTF-8');
        foreach ($tokens as $token) {
            if (preg_match($matchExpression, $token->getTermText()) === 1) {
                $words[] = $token->getTermText();
            }
        }
        
        return $words;
    }

    public function highlightMatchesDOM(Zend_Search_Lucene_Document_Html
$doc, &$colorIndex)
    {
       
$doc->highlight($this->getMatchedWords($doc->getFieldUtf8Value('body')),
$this->_getHighlightColor($colorIndex));
    }


The only new code that needs to be written is in the boolean queries, which
will need to iterate over its subqueries and array_merge() the words each
subquery returns.


This makes it possible to get the matched words with one simple line:

Zend_Search_Lucene_Search_QueryParser::parse('foo* query
string')->getMatchedWords('Hello, my name is Foobar and I am not a query');

and we're off to the races.


What do you say?  I can offer a patch + unit tests if the community thinks
this is worthwhile (though, IMO, this is a quick change).

-- 
View this message in context: 
http://www.nabble.com/Simple-solution-for-Zend-Search-Lucene-highlighting--tp14545203s16154p14545203.html
Sent from the Zend Framework mailing list archive at Nabble.com.

Reply via email to