Re: Fixing query-time multi-word synonym issue

Michael McCandless Sat, 26 Jan 2013 14:55:51 -0800

On Sat, Jan 26, 2013 at 11:16 AM, Jack Krupansky
<j...@basetechnology.com> wrote:
> Yeah, I suspected that it was going to be expert/cryptic. I think the real
> point, from my September proposal, was that we need a common piece of code
> that has all that expert smarts to reconstruct the "graph" and then can
> generate the Lucene Query structure that will match that reconstructed
> graph.


That would be nice to have.  Maybe a possible starting point is
TokenStreamToAutomaton?  The only catch is that this creates an
Automaton with bytes as the labels, but we (somehow) need tokens/terms
as the labels.  Then we just need AutomatonQuery ;)

> It would also be great to have clear, detailed doc of the linear
> graph encoding, but it is the code to reconstitute the graph that is the
> real goal.

+1

> And, of course, we need Synonym filter to do the full graph encoding. Last
> time I looked (September), the PosLenAtt didn't seem to have a value that I
> could make sense of in terms of the graph (I think it always had the same
> value), but maybe I just wasn't interpreting it according to the
> undocumented rules.

At some point it was changed to set PosLenAtt, but only if the output
is a single token (eg, domain name service -> dns; in this case the
dns token will have PosLenAtt=3).

Creating new positions (dns -> domain name service) is a harder
change; normally it's only Tokenizers that create positions.  But eg
WordDelimiterFilter also needs to creates positions.

It would be nice to have some supporting infrastructure to make it
easier for token filters to create new positions; the trickiness is it
requires the token filter to be non-deterministic in general because
the pos len of a given token can be a function of whether future
tokens are split (creating new positions).

> I think a fair number of people want to be able to do query-time synonyms,
> so I wouldn't classify that as an "expert" use case, but I agree that
> dealing with graph encoding/decoding is more of an expert chore.

Right, it would be nice to support both, and since query-time could in
theory work "correctly" (unlike index time since we can't index the
graph), it's perhaps the more compelling / easier route for apps that
really need exact/expanding multi-word synonyms.

But we have to fix the QueryParsers first: indexing already cannot
properly encode the graph, and query-time cannot get past the
QueryParser ...

> Although,
> in the end, it is really just a small number of people who work on query
> parsers who actually have a "need to know" about synonym graphs.

True.

Mike McCandless

http://blog.mikemccandless.com

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Re: Fixing query-time multi-word synonym issue

Reply via email to