On Sep 12, 2009, at 5:12 AM, Michael McCandless wrote:
OK thanks for the responses. This is indeed tricky stuff!
On Sat, Sep 12, 2009 at 12:28 AM, Mark Miller
<markrmil...@gmail.com> wrote:
They start at the left and march right - each Span always starting
after the last started,
That's not quite always true -- eg I got span 1-8, twice, once I added
"b" as a clause to the SNQ.
You might want exhaustive for highlighting as well - but its
different algorithms ...
Yeah, how we would represent spans for highlighting is tricky... we
had discussed this ("how to represent spans for aggregate queries")
recently, I think under LUCENE-1522.
I think we'd have to return a tree structure, that mirrors the query's
tree structure, to hold the spans, rather than try to enumerate
("denormalize") all possible expansions. Each leaf node would hold
actual data (position, term, payload, etc.), and then the tree nodes
would express how they are and/ord/near'd together. My app could then
walk the tree to compute any combination I wanted.
In the end, I accepted my definition of works as - when I ask for
the payloads back, will I end up with a bag of all the payloads that
the Spans touched. I think you do.
Yeah I think you do, except each payload is only returned once. So
it's only the first span that hits a payload that will return it.
So it sounds like SNQ just isn't guaranteed to be exhaustive in how it
enumerates the spans, eg I'll never see that 2nd occurrence of "k",
nor its associated payload.
I believe this is my understanding as well. If Doug and Paul chime
in, maybe we will know better.
That being said, I think it is reasonable to want to have an
exhaustive list of matches, even when they overlap. We simply could
create a new SpanNear that does this.
For now I'll just match this behavior ("can only load payload once")
in all codecs in LUCENE-1458... the test passes again once I do that.
I meant, all those Spans came from one query - so you got your bag
of payloads right? If each Span was a separate entity, it would
obviously be way wrong - but from a single SpanQuery, at least you
got all the payloads in some form :)
Right, this is all one query... but the payload for the 2nd
occurrence of "k" was never included in any span so I didn't get "all"
payloads.
Maybe if/once we incorporate spans into Lucene's normal queries
(optionally, so there's no performance hit if you don't ask for them)
we can re-visit these issues.
Good luck with that! :-) The SpanQuery themselves ask for them as it
is now. The bigger bugaboo to fix, I think, is the use case I laid
out a bit ago where it is a real pain to coalesce both the results of
running the query with effectively accessing the Spans and not having
to constantly reset/skipTo.
---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org