Hi all! I have a ~20GB index of documents that have words with several attributes associated with them, e.g.:
WORD: word_1 word_2 ... word_n POS: pos1_1:pos1_2:pos1:3 pos2 ... pos_n_1:pos_n_2 LEMMA: lemma1_1:lemma1:2:lemma1_3 lemma2 lemma_n_1:lemma_n_2 Field tokens separated by ':' are ambiguous, i.e. they correspond to the same position in the document. An important detail of ambiguous word attributes is that, e.g., pos1_1 corresponds only to lemma1_1, not to lemma1_2 or 1_3, so one must not match word_1 when searching for pos1_1 & lemma1_3 at the same position. I handle ambiguous tokens position with standard positionIncrement = 0, and attribute number correspondence with token payloads. Say, lemma1_1 has payload = 1, lemma1_2 - 2; pos1_1 - 1, pos1_2 - 2, and so on. And while searching for token attributes at the same position I use payload filter that checks if the payloads of all tokens matched are the same. And that's it: SpanNearQueries run super slow on that index (10's of seconds, and the majority of indexed documents matches to a common query). I don't know actually how SpanQueries work in-depth, but is there some inefficiency in them by design? Or is payload retrieval so expensive? I'm just wondering if I'm missing something obvious that slows down the entire search. -- Best regards, Igor --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org