Re: Multiword Highlighting

2007-02-16 Thread Erick Erickson
It must be time to eat lunch, since the more I stare at this code, the less sense it makes to me. Which is a sure sign that I need a break G. But a couple of things. 1 my test cases throw some exceptions with the code as-is. The spans.get(0) is a problem in that it's not guaranteed that the

Re: Multiword Highlighting

2007-02-16 Thread Mark Miller
1 my test cases throw some exceptions with the code as-is. The spans.get(0) is a problem in that it's not guaranteed that the spans returned will have anything in them. Also, I don't think that the test for reqSpans.get(0).next in queryClauses[i].isRequired is correct (even if it doesn't

Re: Multiword Highlighting

2007-02-16 Thread Erick Erickson
That may be the difference then. I'm actually working with both a complete index and a memory index, depending on what phase I'm in. It turns out that I probably can't put the document in a memoryindex on the fly because...well...because G... That said, though, I can pretty easily use this as a

Re: Multiword Highlighting

2007-02-15 Thread Erick Erickson
I hope you're all following this old thread, because I've just run into something I don't quite know what to do about with the SpansExtractor code that I shamelessly stole. Let's say my text is a b c d e f g h and my query is a AND z. The implementation I stole for SpansExtractor (mentioned

Re: Multiword Highlighting

2007-02-15 Thread Mark Miller
Good catch Erick! I'll have to tackle this as well. Mark H is the originator of that code so maybe he will chime in, but what I am think is this: In the getSpansFromBooleanquery, keep track of which clauses are required. Then based on if any Spans are actually returned from getSpansFromTerm

Re: Multiword Highlighting

2007-02-15 Thread Erick Erickson
Mark: Thanks, that reassures me that I'm not hallucinating. If it gets on my priority list I can certainly share the code, since I stole it in the first place G. I have a semi-solution for now that gets me out from under the immediate problem, but it really wants a more robust solution than the

Re: Multiword Highlighting

2007-02-15 Thread Mark Miller
Here is my initial attempt...I believe it might be sufficient: import org.apache.lucene.index.IndexReader; import org.apache.lucene.index.Term; import org.apache.lucene.search.BooleanClause; import org.apache.lucene.search.BooleanQuery; import org.apache.lucene.search.PhraseQuery; import

Re: Multiword Highlighting

2007-02-15 Thread Erick Erickson
Excellent! I'll give it a whirl in the morning. This may keep me from having to rebuild my index as well, oh joy! Thanks Erick On 2/15/07, Mark Miller [EMAIL PROTECTED] wrote: Here is my initial attempt...I believe it might be sufficient: import org.apache.lucene.index.IndexReader; import

Re: Multiword Highlighting

2007-02-02 Thread Mark Miller
I have been away from this for a week, but my interest has started building again. The whole spans implementation seems to work great for finding the actual hits but there is a somewhat annoying limitation: because I am using Spans it seems I can only either highlight the entire found span or

Re: Multiword Highlighting

2007-02-02 Thread mark harwood
a new SpansBasedHighlighter. Cheers, Mark - Original Message From: Mark Miller [EMAIL PROTECTED] To: java-user@lucene.apache.org Sent: Friday, 2 February, 2007 3:58:01 PM Subject: Re: Multiword Highlighting I have been away from this for a week, but my interest has started building again

Re: Multiword Highlighting

2007-02-02 Thread Mark Miller
mark harwood wrote: Hi Mark, Have you looked at the returned spans from any other potential problem scenarios (other than the 3 word one you suggest) e.g. complex nested SpanOr or SpanNot logic? Nothing super intense, but I haved look at some semi complex nesting and it all looks great if

Re: Multiword Highlighting

2007-01-28 Thread markharw00d
For what it's worth Mark (Miller), there *is* a need for just highlight the query terms without trying to get excerpts functionality - something a la Google cache (different colours...mmm, nice). FWIW, the existing highlighter doesn't *have* to fragment - just pass a NullFragmenter to the

Re: Multiword Highlighting

2007-01-28 Thread Mark Miller
I do use the NullFragmenter now. I have no interest in the fragments at the moment, just in showing hits on the source document. It would be great if I could just show the real hits though. The span approach seems to work fine for me. I have even tested the highlighting using my sentence and

Re: Multiword Highlighting

2007-01-27 Thread Mark Miller
Isn't it semi trivial if you are not interested in the fragments (I swear it seems that most people are not)? Isn't it you that suggested turning the query into a SpanQuery, extracting the spans and then doing the highlighting after a rewrite? This seems somewhat trivial so what am I missing?

Re: Multiword Highlighting

2007-01-27 Thread markharw00d
Isn't it semi trivial if you are not interested in the fragments (I swear it seems that most people are not)? I I haven't conducted a survey but it's the typical web search engine scenario - select only a small subset of the matching document content for display in SERPS. I would expect that

Re: Multiword Highlighting

2007-01-27 Thread Mark Miller
markharw00d wrote: Isn't it semi trivial if you are not interested in the fragments (I swear it seems that most people are not)? I I haven't conducted a survey but it's the typical web search engine scenario - select only a small subset of the matching document content for display in

Re: Multiword Highlighting

2007-01-27 Thread Mark Miller
Maybe a new highlighter with no attempt at summarising could more easily address phrase support for small pieces of content. It will always be hard to faithfully represent all possible query match logic - especially if there are NOTs, ANDs and ORs mixed in with all the term proximity

Re: Multiword Highlighting

2007-01-27 Thread Otis Gospodnetic
contrib to contrib/ if you end up working on this. Otis -- Simpy -- http://www.simpy.com/ -- Tag. Search. Share. - Original Message From: Mark Miller [EMAIL PROTECTED] To: java-user@lucene.apache.org Sent: Sunday, January 28, 2007 7:39:29 AM Subject: Re: Multiword Highlighting

Multiword Highlighting

2007-01-26 Thread Anne Conger
Hi, I'm wondering what the best way is to do highlighting of multiword phrases. For example, if a search is for president kennedy, how can I make sure that president is only highlighted if it is next to kennedy and president in president clinton is not. I haven't figured out where in the process

Re: Multiword Highlighting

2007-01-26 Thread markharw00d
This is a deficiency in the highlighter functionality that has been discussed several times before. The summary is - not a trivial fix. See here for background: http://marc2.theaimsgroup.com/?l=lucene-userm=114631181214303w=1