Replying to myself...
After thinking about it some more, I think it could become even more useful
if it returned the actual token object instead of just the name. For
example:
public function getMatchedTokens($string)
{
$words = array();
$matchExpression = '/^' . str_replace(array('\\?', '\\*'),
array('.', '.*') , preg_quote($this->_pattern->text, '/')) . '$/';
if (@preg_match('/\pL/u', 'a') == 1) {
// PCRE unicode support is turned on
// add Unicode modifier to the match expression
$matchExpression .= 'u';
}
$tokens =
Zend_Search_Lucene_Analysis_Analyzer::getDefault()->tokenize($string,
'UTF-8');
foreach ($tokens as $token) {
if (preg_match($matchExpression, $token->getTermText()) === 1) {
$words[] = $token; // WAS $token->getTermText()
}
}
return $words;
}
Carl.Vondrick wrote:
>
>
Looking through the Zend Search Lucene source code, I think there's a simple
change that can make it possible to use a custom highlighting system with
ZSL and at least take a step towards solving the highlighting extensibility
problems.
>
>
The primary issue with using a custom highlighter with ZSL is that it's
currently difficult to get an array of words to be highlighted from a query.
This has to be done outside of ZSL and adds unnecessary complexity.
Throughout the various query objects, in the ->highlightMatchesDOM()
methods, the array of words we are looking for is generated, but then made
impossible to access by doing the actual highlighting.
>
>
The quick and simple change is this: separate the ->highlightMatchesDOM()
method into ->getMatchedWords() and ->highlightedMatchesDOM(). So, for the
Wildcard query, we have:
>
>
> public function getMatchedWords($string)
> {
> $words = array();
>
> $matchExpression = '/^' . str_replace(array('\\?', '\\*'),
> array('.', '.*') , preg_quote($this->_pattern->text, '/')) . '$/';
> if (@preg_match('/\pL/u', 'a') == 1) {
> // PCRE unicode support is turned on
> // add Unicode modifier to the match expression
> $matchExpression .= 'u';
> }
>
> $tokens =
> Zend_Search_Lucene_Analysis_Analyzer::getDefault()->tokenize($string,
> 'UTF-8');
> foreach ($tokens as $token) {
> if (preg_match($matchExpression, $token->getTermText()) === 1)
> {
> $words[] = $token->getTermText();
> }
> }
>
> return $words;
> }
>
> public function highlightMatchesDOM(Zend_Search_Lucene_Document_Html
> $doc, &$colorIndex)
> {
>
> $doc->highlight($this->getMatchedWords($doc->getFieldUtf8Value('body')),
> $this->_getHighlightColor($colorIndex));
> }
>
>
>
The only new code that needs to be written is in the boolean queries, which
will need to iterate over its subqueries and array_merge() the words each
subquery returns.
>
>
This makes it possible to get the matched words with one simple line:
>
Zend_Search_Lucene_Search_QueryParser::parse('foo* query
string')->getMatchedWords('Hello, my name is Foobar and I am not a query');
>
and we're off to the races.
>
>
What do you say? I can offer a patch + unit tests if the community thinks
this is worthwhile (though, IMO, this is a quick change).
>
--
View this message in context:
http://www.nabble.com/Simple-solution-for-Zend-Search-Lucene-highlighting--tp14545203s16154p14561466.html
Sent from the Zend Framework mailing list archive at Nabble.com.