Re: Lucene in-memory index

Michael McCandless Sat, 19 Oct 2013 03:54:39 -0700

On Fri, Oct 18, 2013 at 5:50 PM, Igor Shalyminov
<ishalymi...@yandex-team.ru> wrote:
> But why is it so costly?


I think because the matching is inherently complex?  But also because
it does high-cost things like allocating new List and Set for every
matched doc (e.g. NearSpansOrdered.shrinkToAfterShortestMatch) to hold
all payloads it encountered within each span. Patches welcome!

> In a regular query we walk postings and match document numbers, in a 
> SpanQuery we match position numbers (or position segments), what's the 
> principal difference?
> I think it's just that #documents << #positions.

Conceptually, that's right, we just need to decode "more ints" (and
also the payloads), but then need to essentially merge-sort the
positions of N terms, and then "coalesce" them into spans, is at heart
rather costly.  Lots of hard-for-CPU-to-predict branches...

But I suspect we could get some good speedups on span queries with a
better implementation;
https://issues.apache.org/jira/browse/LUCENE-2878 is [slowly]
exploring making positions "first class" in Scorer, so you can iterate
over position + payload for each hit.

> For "A,sg" and "A,pl" I use unordered SpanNearQueries with the slop=-1.

I didn't even realize you could pass negative slop to span queries.
What does that do?  Or did you mean slop=1?

> I wrap them into an ordered SpanNearQuery with the slop=0.
>
> I see getPayload() in the profiler top. I think I can emulate payload 
> checking with cleverly assigned position increments (and then maximum 
> position in a document might jump up to ~10^9 - I hope it won't blow the 
> whole index up).
>
> If I remove payload matching and keep only position checking, will it speed 
> up everything, or the positions and payloads are the same?

I think it would help to avoid payloads, but I'm not sure by how much.
 E.g., I see that NearSpansOrdered creates a new Set for every hit
just to hold payloads, even if payloads are not going to be used.
Really the span scorers should check Terms.hasPayloads up front ...

> My main goal is getting the precise results for a query, so proximity 
> boosting won't help, unfortunately.

OK.

I wonder if you can somehow identify the spans you care about at
indexing time, e.g. A,sg followed by N,sg and e.g. add a span into the
index at that point; this would make searching much faster (it becomes
a TermQuery).  For exact matching (slop=0) you can also index
shingles.

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Re: Lucene in-memory index

Reply via email to