Under the hood of SpanQueries

Igor Shalyminov Wed, 03 Apr 2013 14:55:51 -0700

Hi all!

I have a ~20GB index of documents that have words with several attributes 
associated with them, e.g.:


WORD: word_1 word_2 ... word_n
POS:    pos1_1:pos1_2:pos1:3 pos2 ... pos_n_1:pos_n_2
LEMMA: lemma1_1:lemma1:2:lemma1_3 lemma2 lemma_n_1:lemma_n_2

Field tokens separated by ':' are ambiguous, i.e. they correspond to the same 
position in the document.
An important detail of ambiguous word attributes is that, e.g., pos1_1 
corresponds only to lemma1_1, not to lemma1_2 or 1_3, so one must not match 
word_1 when searching for pos1_1 & lemma1_3 at the same position.

I handle ambiguous tokens position with standard positionIncrement = 0, and 
attribute number correspondence with token payloads. Say, lemma1_1 has payload 
= 1, lemma1_2 - 2; pos1_1 - 1, pos1_2 - 2, and so on. And while searching for 
token attributes at the same position I use payload filter that checks if the 
payloads of all tokens matched are the same.

And that's it: SpanNearQueries run super slow on that index (10's of seconds, 
and the majority of indexed documents matches to a common query).
I don't know actually how SpanQueries work in-depth, but is there some 
inefficiency in them by design? Or is payload retrieval so expensive?
I'm just wondering if I'm missing something obvious that slows down the entire 
search.

-- 
Best regards,
Igor

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Under the hood of SpanQueries

Reply via email to