Hello Carl,
I just wanted to let you know that at work we've created a similar
solution. However it isn't until next Tuesday I can provide code
snippets of our solution to further discuss this matter (everyone's
still on holiday here).
- Markus
Carl.Vondrick wrote:
Replying to myself... After thinking about it some more, I think it
could become even more useful if it returned the actual token object
instead of just the name. For example:
public function *getMatchedTokens*($string)
{
$words = array();
$matchExpression = '/^' . str_replace(array('\\?', '\\*'), array('.', '.*') ,
preg_quote($this->_pattern->text, '/')) . '$/';
if (@preg_match('/\pL/u', 'a') == 1) {
// PCRE unicode support is turned on
// add Unicode modifier to the match expression
$matchExpression .= 'u';
}
$tokens =
Zend_Search_Lucene_Analysis_Analyzer::getDefault()->tokenize($string, 'UTF-8');
foreach ($tokens as $token) {
if (preg_match($matchExpression, $token->getTermText()) === 1) {
*$words[] = $token;* // WAS $token->getTermText()
}
}
return $words;
}
Carl.Vondrick wrote:
Looking through the Zend Search Lucene source code, I think there's
a simple change that can make it possible to use a custom
highlighting system with ZSL and at least take a step towards
solving the highlighting extensibility problems.
The primary issue with using a custom highlighter with ZSL is that
it's currently difficult to get an array of words to be highlighted
from a query. This has to be done outside of ZSL and adds
unnecessary complexity. Throughout the various query objects, in the
->highlightMatchesDOM() methods, the array of words we are looking
for is generated, but then made impossible to access by doing the
actual highlighting.
The quick and simple change is this: separate the
->highlightMatchesDOM() method into ->getMatchedWords() and
->highlightedMatchesDOM(). So, for the Wildcard query, we have:
public function getMatchedWords($string)
{
$words = array();
$matchExpression = '/^' . str_replace(array('\\?', '\\*'), array('.',
'.*') , preg_quote($this->_pattern->text, '/')) . '$/';
if (@preg_match('/\pL/u', 'a') == 1) {
// PCRE unicode support is turned on
// add Unicode modifier to the match expression
$matchExpression .= 'u';
}
$tokens =
Zend_Search_Lucene_Analysis_Analyzer::getDefault()->tokenize($string, 'UTF-8');
foreach ($tokens as $token) {
if (preg_match($matchExpression, $token->getTermText()) === 1) {
$words[] = $token->getTermText();
}
}
return $words;
}
public function highlightMatchesDOM(Zend_Search_Lucene_Document_Html $doc,
&$colorIndex)
{
$doc->highlight($this->getMatchedWords($doc->getFieldUtf8Value('body')),
$this->_getHighlightColor($colorIndex));
}
The only new code that needs to be written is in the boolean
queries, which will need to iterate over its subqueries and
array_merge() the words each subquery returns.
This makes it possible to get the matched words with one simple line:
Zend_Search_Lucene_Search_QueryParser::parse('foo* query
string')->getMatchedWords('Hello, my name is Foobar and I am not a query');
and we're off to the races.
What do you say? I can offer a patch + unit tests if the community
thinks this is worthwhile (though, IMO, this is a quick change).
------------------------------------------------------------------------
View this message in context: Re: Simple solution for Zend Search Lucene
highlighting?
<http://www.nabble.com/Simple-solution-for-Zend-Search-Lucene-highlighting--tp14545203s16154p14561466.html>
Sent from the Zend Framework mailing list archive
<http://www.nabble.com/Zend-Framework-f15440.html> at Nabble.com.