Replying to myself...

After thinking about it some more, I think it could become even more useful
if it returned the actual token object instead of just the name. For
example:


    public function getMatchedTokens($string)
    {
        $words = array();

        $matchExpression = '/^' . str_replace(array('\\?', '\\*'),
array('.', '.*') , preg_quote($this->_pattern->text, '/')) . '$/';
        if (@preg_match('/\pL/u', 'a') == 1) {
            // PCRE unicode support is turned on
            // add Unicode modifier to the match expression
            $matchExpression .= 'u';
        }

        $tokens =
Zend_Search_Lucene_Analysis_Analyzer::getDefault()->tokenize($string,
'UTF-8');
        foreach ($tokens as $token) {
            if (preg_match($matchExpression, $token->getTermText()) === 1) {
                $words[] = $token; // WAS $token->getTermText()
            }
        }
        
        return $words;
    }



Carl.Vondrick wrote:
> 
> 
Looking through the Zend Search Lucene source code, I think there's a simple
change that can make it possible to use a custom highlighting system with
ZSL and at least take a step towards solving the highlighting extensibility
problems.

> 
> 
The primary issue with using a custom highlighter with ZSL is that it's
currently difficult to get an array of words to be highlighted from a query. 
This has to be done outside of ZSL and adds unnecessary complexity.
Throughout the various query objects, in the ->highlightMatchesDOM()
methods, the array of words we are looking for is generated, but then made
impossible to access by doing the actual highlighting.

> 
> 
The quick and simple change is this: separate the ->highlightMatchesDOM()
method into ->getMatchedWords() and ->highlightedMatchesDOM().  So, for the
Wildcard query, we have:

> 
> 

>     public function getMatchedWords($string)
>     {
>         $words = array();
> 
>         $matchExpression = '/^' . str_replace(array('\\?', '\\*'),
> array('.', '.*') , preg_quote($this->_pattern->text, '/')) . '$/';
>         if (@preg_match('/\pL/u', 'a') == 1) {
>             // PCRE unicode support is turned on
>             // add Unicode modifier to the match expression
>             $matchExpression .= 'u';
>         }
> 
>         $tokens =
> Zend_Search_Lucene_Analysis_Analyzer::getDefault()->tokenize($string,
> 'UTF-8');
>         foreach ($tokens as $token) {
>             if (preg_match($matchExpression, $token->getTermText()) === 1)
> {
>                 $words[] = $token->getTermText();
>             }
>         }
>         
>         return $words;
>     }
> 
>     public function highlightMatchesDOM(Zend_Search_Lucene_Document_Html
> $doc, &$colorIndex)
>     {
>        
> $doc->highlight($this->getMatchedWords($doc->getFieldUtf8Value('body')),
> $this->_getHighlightColor($colorIndex));
>     }
> 

> 
> 
The only new code that needs to be written is in the boolean queries, which
will need to iterate over its subqueries and array_merge() the words each
subquery returns.

> 
> 
This makes it possible to get the matched words with one simple line:

> 
Zend_Search_Lucene_Search_QueryParser::parse('foo* query
string')->getMatchedWords('Hello, my name is Foobar and I am not a query');

> 
and we're off to the races.

> 
> 
What do you say?  I can offer a patch + unit tests if the community thinks
this is worthwhile (though, IMO, this is a quick change).

> 

-- 
View this message in context: 
http://www.nabble.com/Simple-solution-for-Zend-Search-Lucene-highlighting--tp14545203s16154p14561466.html
Sent from the Zend Framework mailing list archive at Nabble.com.

Reply via email to