Hi
I am playing with searching over annotated text - that is some stream of
text is indexed under one field, and I have annotations indexed under
another field. The annotations are indexed with custom positions that I
inject, based on the original position of the text.
So for example I might have a piece of text "brown fox" at positions 15,16
that is annotated as "nickname". In that case the annotation constitutes
two terms: annot:nick and annoth:/nick in positions 15 and 16 respectively.
I then implemented a variation of a span query which I call SpanWithinQuery
which allows me to search for "fox WITHIN nickname". The way it's
implemented is a SpanNearQuery on annot:nick and annot:/nick, and whenever
a match is found for 'fox', it makes sure that the position falls within
the range of the nick annotation. Quite simple.
The problem I ran into is when the annotation spans a single position, in
which case the start and end annotation are indexed in the same position.
So e.g. in the example above, if "brown" was further annotated as "color",
I'd have annot:color and annot:/color indexed in position 15. Now when I
execute a SNQ(annot:nick, annot:/nick, slop=MAX_INT, ordered=true), I get 0
results.
I tracked this down to NearSpansOrdered.docSpansOrdered(), which documents
and implements the check to require that if the spans start in the same
position, their end position must not be the same.
If you forget about all the preface I wrote above, I have a basic question
about the semantics of SNQ. I tried a simple case -- SNQ("fox", "fox") and
I get 0 results. Is this a bug in SNQ itself, or it's not built to find
spans that are identical in their start/end positions, and I should use
another variant of SpanQuery?
BTW, I'm aware that the way I play with the annotated text could probably
be done in other ways, e.g. index annot:color with a payload with the span
range, but for now I'd like to know if SNQ (or NearSpansOrdered) has a bug.
Shai