Re: [fw-general] Simple solution for Zend Search Lucene highlighting?

Markus Fischer Sat, 05 Jan 2008 01:35:58 -0800

Hello Carl,

I just wanted to let you know that at work we've created a similarsolution. However it isn't until next Tuesday I can provide codesnippets of our solution to further discuss this matter (everyone'sstill on holiday here).


- Markus

Carl.Vondrick wrote:

Replying to myself... After thinking about it some more, I think itcould become even more useful if it returned the actual token objectinstead of just the name. For example:


    public function *getMatchedTokens*($string)
    {
        $words = array();

        $matchExpression = '/^' . str_replace(array('\\?', '\\*'), array('.', '.*') , 
preg_quote($this->_pattern->text, '/')) . '$/';
        if (@preg_match('/\pL/u', 'a') == 1) {
            // PCRE unicode support is turned on
            // add Unicode modifier to the match expression
            $matchExpression .= 'u';
        }

        $tokens = 
Zend_Search_Lucene_Analysis_Analyzer::getDefault()->tokenize($string, 'UTF-8');
        foreach ($tokens as $token) {
            if (preg_match($matchExpression, $token->getTermText()) === 1) {
                *$words[] = $token;* // WAS $token->getTermText()
            }
        }

return $words;

    }

    Carl.Vondrick wrote:

    Looking through the Zend Search Lucene source code, I think there's
    a simple change that can make it possible to use a custom
    highlighting system with ZSL and at least take a step towards
    solving the highlighting extensibility problems.

    The primary issue with using a custom highlighter with ZSL is that
    it's currently difficult to get an array of words to be highlighted
    from a query. This has to be done outside of ZSL and adds
    unnecessary complexity. Throughout the various query objects, in the
    ->highlightMatchesDOM() methods, the array of words we are looking
    for is generated, but then made impossible to access by doing the
    actual highlighting.

    The quick and simple change is this: separate the
    ->highlightMatchesDOM() method into ->getMatchedWords() and
    ->highlightedMatchesDOM(). So, for the Wildcard query, we have:

        public function getMatchedWords($string)
        {
            $words = array();

            $matchExpression = '/^' . str_replace(array('\\?', '\\*'), array('.', 
'.*') , preg_quote($this->_pattern->text, '/')) . '$/';
            if (@preg_match('/\pL/u', 'a') == 1) {
                // PCRE unicode support is turned on
                // add Unicode modifier to the match expression
                $matchExpression .= 'u';
            }

            $tokens = 
Zend_Search_Lucene_Analysis_Analyzer::getDefault()->tokenize($string, 'UTF-8');
            foreach ($tokens as $token) {
                if (preg_match($matchExpression, $token->getTermText()) === 1) {
                    $words[] = $token->getTermText();
                }
            }

return $words;

        }

        public function highlightMatchesDOM(Zend_Search_Lucene_Document_Html $doc, 
&$colorIndex)
        {
            
$doc->highlight($this->getMatchedWords($doc->getFieldUtf8Value('body')), 
$this->_getHighlightColor($colorIndex));
        }

    The only new code that needs to be written is in the boolean
    queries, which will need to iterate over its subqueries and
    array_merge() the words each subquery returns.

    This makes it possible to get the matched words with one simple line:

    Zend_Search_Lucene_Search_QueryParser::parse('foo* query 
string')->getMatchedWords('Hello, my name is Foobar and I am not a query');

    and we're off to the races.

    What do you say? I can offer a patch + unit tests if the community
    thinks this is worthwhile (though, IMO, this is a quick change).


------------------------------------------------------------------------

View this message in context: Re: Simple solution for Zend Search Lucenehighlighting?<http://www.nabble.com/Simple-solution-for-Zend-Search-Lucene-highlighting--tp14545203s16154p14561466.html>Sent from the Zend Framework mailing list archive<http://www.nabble.com/Zend-Framework-f15440.html> at Nabble.com.

Re: [fw-general] Simple solution for Zend Search Lucene highlighting?

Reply via email to