Re: Lucene in-memory index

Igor Shalyminov Fri, 25 Oct 2013 07:00:38 -0700

What is ProxBooleanTermQuery?
I couldn't find it in the trunk and in that ticket's 
(https://issues.apache.org/jira/browse/LUCENE-2878) patch.
And for now it's very fuzzy to me how the searching/scoring works. Are there 
any tutorials or talks on how do Queries, Scorers, Collectors interoperate?



-- 
Igor

23.10.2013, 19:06, "Michael McCandless" <luc...@mikemccandless.com>:
> On Tue, Oct 22, 2013 at 9:43 AM, Igor Shalyminov
> <ishalymi...@yandex-team.ru> wrote:
>
>>  Thanks for the link, I'll definitely dig into SpanQuery internals very soon.
>
> You could also just make a custom query.  If you start from the
> ProxBooleanTermQuery on that issue, but change it so that it rejects
> hits that didn't have terms in the right positions, then you'll likely
> have a much faster way to do your query.
>
>>>>   For "A,sg" and "A,pl" I use unordered SpanNearQueries with the slop=-1.
>>>  I didn't even realize you could pass negative slop to span queries.
>>>  What does that do?  Or did you mean slop=1?
>>  I indeed use an unordered SpanNearQuery with the slop = --1 (I saw it on 
>> some forum, maybe here: 
>> http://www.gossamer-threads.com/lists/lucene/java-user/89377?do=post_view_flat#89377)
>
> Wow, OK.  I have no idea what slop=-1 does...
>
>>  So far it works for me:)
>>>>   I wrap them into an ordered SpanNearQuery with the slop=0.
>>>>
>>>>   I see getPayload() in the profiler top. I think I can emulate payload 
>>>> checking with cleverly assigned position increments (and then maximum 
>>>> position in a document might jump up to ~10^9 - I hope it won't blow the 
>>>> whole index up).
>>>>
>>>>   If I remove payload matching and keep only position checking, will it 
>>>> speed up everything, or the positions and payloads are the same?
>>>  I think it would help to avoid payloads, but I'm not sure by how much.
>>>   E.g., I see that NearSpansOrdered creates a new Set for every hit
>>>  just to hold payloads, even if payloads are not going to be used.
>>>  Really the span scorers should check Terms.hasPayloads up front ...
>>>>   My main goal is getting the precise results for a query, so proximity 
>>>> boosting won't help, unfortunately.
>>>  OK.
>>>
>>>  I wonder if you can somehow identify the spans you care about at
>>>  indexing time, e.g. A,sg followed by N,sg and e.g. add a span into the
>>>  index at that point; this would make searching much faster (it becomes
>>>  a TermQuery).  For exact matching (slop=0) you can also index
>>>  shingles.
>>  Thanks for the clue, I think it can be a good optimization heuristic.
>>  I actually tried a similar approach to optimize search of attributes at the 
>> same position.
>>  Here's how it was supposed to work for a feature set "S,sg,nom,fem":
>>
>>  * the regular approach: split it into grammar atomics: "S", "sg", "nom", 
>> "fem". With payloads and positions assigned the right way, this would allow 
>> us to search for an arbitrary combination of these attributes _but_ with 
>> multiple postings merging.
>>  * the experimental approach: sort the atomics lexicographically and index 
>> all the subsets: "S", "fem", "nom", "sg", "S,fem", "S,nom", ..., 
>> "S,fem,nom,sg". With the preprocessing of the user query the same way (split 
>> - sort - join) it would allow us to process the same queries exactly within 
>> one posting.
>>
>>  This technique is actually used in our current production index based on 
>> Yandex.Server engine.
>>  But Yandex.Server somehow makes the index size reasonable (within the order 
>> of magnitude of original text size), while Lucene index blows up totally ( 
>> >10 times original text size) and no search performance improvements appear.
>
> That's really odd.  I would expect index to become much larger, but
> search performance ought to be much faster since you run simple
> TermQuery.
>
> Mike McCandless
>
> http://blog.mikemccandless.com
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Re: Lucene in-memory index

Reply via email to