I didn't make too much progress, and kind of ended up dropping it.
One thing that I played with was creating multiple phrase indexes, one each for 2, 3, 4, and 5 words. I wrote a tokenizer that would batch up the words, so, for the input string: The quick brown fox jumps over the slow lazy dog. The tokenizer for 3 words would return: The quick brown Quick brown fox Brown fox jumps Fox jumps over ... This seemed like a reasonably start... the problem is resolving the overlap for display, and figuring out which words are the most important, e.g. if the above sentence itself was pretty rare, and you're looking at the phrase-index-3, each one of its sub-phrases would end up being significant.... Which one do you show? Or do you combine them into a longer phrase? If so, where do you stop? It seemed like an easy first-approach to try out, but I'm not sure it's even in the right direction... ________________________________ From: Nader Akhnoukh [mailto:[EMAIL PROTECTED] Sent: Wednesday, June 21, 2006 8:14 PM To: Larry Ogrodnek Subject: Lucene and SIPs Hi Lawrence, I saw a posting to the Lucene group you made in February concerning using Lucene to find SIPs. Did you make any progress with this? I'm able to find significant single terms, but am stumped by phrases. Thanks, Nader