On Wed, Jun 3, 2009 at 7:34 PM, Mark Miller <[email protected]> wrote:
> Max Lynch wrote:
>
>> Well what happens is if I use a SpanScorer instead, and allocate it like
>>>
>>>
>>
>>
>>
>>> such:
>>>>
>>>> analyzer = StandardAnalyzer([])
>>>> tokenStream = analyzer.tokenStream("contents",
>>>> lucene.StringReader(text))
>>>> ctokenStream = lucene.CachingTokenFilter(tokenStream)
>>>> highlighter = lucene.Highlighter(formatter,
>>>> lucene.HighlighterSpanScorer(self.query, "contents", ctokenStream))
>>>> ctokenStream.reset()
>>>>
>>>> result = highlighter.getBestFragments(ctokenStream, text,
>>>> 2, "...")
>>>>
>>>> My highlighter is still breaking up words inside of a span. For
>>>>
>>>>
>>> example,
>>>
>>>
>>>> if I search for \"John Smith\", instead of the highlighter being called
>>>>
>>>>
>>> for
>>>
>>>
>>>> the whole "John Smith", it gets called for "John" and then "Smith".
>>>>
>>>>
>>> I think you need to use SimpleSpanFragmenter (vs SimpleFragmenter,
>>> which is the default used by Highlighter) to ensure that each fragment
>>> contains a full match for the query. EG something like this (copied
>>> from LIA 2nd edition):
>>>
>>> TermQuery query = new TermQuery(new Term("field", "fox"));
>>>
>>> TokenStream tokenStream =
>>> new SimpleAnalyzer().tokenStream("field",
>>> new StringReader(text));
>>>
>>> SpanScorer scorer = new SpanScorer(query, "field",
>>> new
>>> CachingTokenFilter(tokenStream));
>>> Fragmenter fragmenter = new SimpleSpanFragmenter(scorer);
>>> Highlighter highlighter = new Highlighter(scorer);
>>> highlighter.setTextFragmenter(fragmenter);
>>>
>>>
>>
>>
>>
>> Okay, I hacked something up in Java that illustrates my issue.
>>
>>
>> import org.apache.lucene.search.*;
>> import org.apache.lucene.analysis.*;
>> import org.apache.lucene.document.*;
>> import org.apache.lucene.index.IndexWriter;
>> import org.apache.lucene.analysis.standard.StandardAnalyzer;
>> import org.apache.lucene.index.Term;
>> import org.apache.lucene.queryParser.QueryParser;
>> import org.apache.lucene.store.Directory;
>> import org.apache.lucene.store.RAMDirectory;
>> import org.apache.lucene.search.highlight.*;
>> import org.apache.lucene.search.spans.SpanTermQuery;
>> import java.io.Reader;
>> import java.io.StringReader;
>>
>> public class PhraseTest {
>> private IndexSearcher searcher;
>> private RAMDirectory directory;
>>
>> public PhraseTest() throws Exception {
>> directory = new RAMDirectory();
>>
>> Analyzer analyzer = new Analyzer() {
>> public TokenStream tokenStream(String fieldName, Reader reader)
>> {
>> return new WhitespaceTokenizer(reader);
>> }
>>
>> public int getPositionIncrementGap(String fieldName) {
>> return 100;
>> }
>> };
>>
>> IndexWriter writer = new IndexWriter(directory, analyzer, true,
>> IndexWriter.MaxFieldLength.LIMITED);
>>
>> Document doc = new Document();
>> String text = "Jimbo John is his name";
>> doc.add(new Field("contents", text, Field.Store.YES,
>> Field.Index.ANALYZED));
>> writer.addDocument(doc);
>>
>> writer.optimize();
>> writer.close();
>>
>> searcher = new IndexSearcher(directory);
>>
>> // Try a phrase query
>> PhraseQuery phraseQuery = new PhraseQuery();
>> phraseQuery.add(new Term("contents", "Jimbo"));
>> phraseQuery.add(new Term("contents", "John"));
>>
>> // Try a SpanTermQuery
>> SpanTermQuery spanTermQuery = new SpanTermQuery(new
>> Term("contents",
>> "Jimbo John"));
>>
>> // Try a parsed query
>> Query parsedQuery = new QueryParser("contents",
>> analyzer).parse("\"Jimbo John\"");
>>
>> Hits hits = searcher.search(parsedQuery);
>> System.out.println("We found " + hits.length() + " hits.");
>>
>> // Highlight the results
>> CachingTokenFilter tokenStream = new
>> CachingTokenFilter(analyzer.tokenStream( "contents", new
>> StringReader(text)));
>>
>> SimpleHTMLFormatter formatter = new SimpleHTMLFormatter();
>>
>> SpanScorer sc = new SpanScorer(parsedQuery, "contents",
>> tokenStream,
>> "contents");
>>
>> Highlighter highlighter = new Highlighter(formatter, sc);
>> highlighter.setTextFragmenter(new SimpleSpanFragmenter(sc));
>> tokenStream.reset();
>>
>> String rv = highlighter.getBestFragments(tokenStream, text, 1,
>> "...");
>> System.out.println(rv);
>>
>> }
>> public static void main(String[] args) {
>> System.out.println("Starting...");
>> try {
>> PhraseTest pt = new PhraseTest();
>> } catch(Exception ex) {
>> ex.printStackTrace();
>> }
>> }
>> }
>>
>>
>>
>> The output I'm getting is instead of highlighting <B>Jimbo John</B> it
>> does
>> <B>Jimbo</B> then <B>John</B>. Can I get around this some how? I tried
>> several different query types (they are declared in the code, but only the
>> parsed version is being used).
>>
>> Thanks
>> -max
>>
>>
>>
> Sorry, not much you can do at the moment. The change is non trivial for
> sure (its probably easier to write some regex that merges them). This
> limitation was accepted because with most markup, it will display the same
> anyway. An option to merge would be great, and while I don't remember the
> details, the last time I looked, it just ain't easy to do based on the
> implementation. The highlighter highlights by running through and scoring
> tokens, not phrases, and the Span highlighter asks if a given token is in a
> given span to see if it should get a score over 0. Token by token handed off
> to the SpanScorer to be scored. I looked into adding the option at one point
> (back when I was putting the SpanScorer together) and didn't find it worth
> the effort after getting blocked a couple times.
>
>
Thanks anyways Mark. Yea what I gathered from the results is that I will
only get hits and highlights for phrases if the whole phrase was found, but
they will be separated. I just combine them now but was hoping for a more
elegant solution. At least I know that what I'm highlighting aren't random
parts of the text, but the actual phrase, so all is not lost.
-max