On Fri, Oct 18, 2013 at 5:50 PM, Igor Shalyminov <ishalymi...@yandex-team.ru> wrote: > But why is it so costly?
I think because the matching is inherently complex? But also because it does high-cost things like allocating new List and Set for every matched doc (e.g. NearSpansOrdered.shrinkToAfterShortestMatch) to hold all payloads it encountered within each span. Patches welcome! > In a regular query we walk postings and match document numbers, in a > SpanQuery we match position numbers (or position segments), what's the > principal difference? > I think it's just that #documents << #positions. Conceptually, that's right, we just need to decode "more ints" (and also the payloads), but then need to essentially merge-sort the positions of N terms, and then "coalesce" them into spans, is at heart rather costly. Lots of hard-for-CPU-to-predict branches... But I suspect we could get some good speedups on span queries with a better implementation; https://issues.apache.org/jira/browse/LUCENE-2878 is [slowly] exploring making positions "first class" in Scorer, so you can iterate over position + payload for each hit. > For "A,sg" and "A,pl" I use unordered SpanNearQueries with the slop=-1. I didn't even realize you could pass negative slop to span queries. What does that do? Or did you mean slop=1? > I wrap them into an ordered SpanNearQuery with the slop=0. > > I see getPayload() in the profiler top. I think I can emulate payload > checking with cleverly assigned position increments (and then maximum > position in a document might jump up to ~10^9 - I hope it won't blow the > whole index up). > > If I remove payload matching and keep only position checking, will it speed > up everything, or the positions and payloads are the same? I think it would help to avoid payloads, but I'm not sure by how much. E.g., I see that NearSpansOrdered creates a new Set for every hit just to hold payloads, even if payloads are not going to be used. Really the span scorers should check Terms.hasPayloads up front ... > My main goal is getting the precise results for a query, so proximity > boosting won't help, unfortunately. OK. I wonder if you can somehow identify the spans you care about at indexing time, e.g. A,sg followed by N,sg and e.g. add a span into the index at that point; this would make searching much faster (it becomes a TermQuery). For exact matching (slop=0) you can also index shingles. --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org