What is ProxBooleanTermQuery? I couldn't find it in the trunk and in that ticket's (https://issues.apache.org/jira/browse/LUCENE-2878) patch. And for now it's very fuzzy to me how the searching/scoring works. Are there any tutorials or talks on how do Queries, Scorers, Collectors interoperate?
-- Igor 23.10.2013, 19:06, "Michael McCandless" <luc...@mikemccandless.com>: > On Tue, Oct 22, 2013 at 9:43 AM, Igor Shalyminov > <ishalymi...@yandex-team.ru> wrote: > >> Thanks for the link, I'll definitely dig into SpanQuery internals very soon. > > You could also just make a custom query. If you start from the > ProxBooleanTermQuery on that issue, but change it so that it rejects > hits that didn't have terms in the right positions, then you'll likely > have a much faster way to do your query. > >>>> For "A,sg" and "A,pl" I use unordered SpanNearQueries with the slop=-1. >>> I didn't even realize you could pass negative slop to span queries. >>> What does that do? Or did you mean slop=1? >> I indeed use an unordered SpanNearQuery with the slop = --1 (I saw it on >> some forum, maybe here: >> http://www.gossamer-threads.com/lists/lucene/java-user/89377?do=post_view_flat#89377) > > Wow, OK. I have no idea what slop=-1 does... > >> So far it works for me:) >>>> I wrap them into an ordered SpanNearQuery with the slop=0. >>>> >>>> I see getPayload() in the profiler top. I think I can emulate payload >>>> checking with cleverly assigned position increments (and then maximum >>>> position in a document might jump up to ~10^9 - I hope it won't blow the >>>> whole index up). >>>> >>>> If I remove payload matching and keep only position checking, will it >>>> speed up everything, or the positions and payloads are the same? >>> I think it would help to avoid payloads, but I'm not sure by how much. >>> E.g., I see that NearSpansOrdered creates a new Set for every hit >>> just to hold payloads, even if payloads are not going to be used. >>> Really the span scorers should check Terms.hasPayloads up front ... >>>> My main goal is getting the precise results for a query, so proximity >>>> boosting won't help, unfortunately. >>> OK. >>> >>> I wonder if you can somehow identify the spans you care about at >>> indexing time, e.g. A,sg followed by N,sg and e.g. add a span into the >>> index at that point; this would make searching much faster (it becomes >>> a TermQuery). For exact matching (slop=0) you can also index >>> shingles. >> Thanks for the clue, I think it can be a good optimization heuristic. >> I actually tried a similar approach to optimize search of attributes at the >> same position. >> Here's how it was supposed to work for a feature set "S,sg,nom,fem": >> >> * the regular approach: split it into grammar atomics: "S", "sg", "nom", >> "fem". With payloads and positions assigned the right way, this would allow >> us to search for an arbitrary combination of these attributes _but_ with >> multiple postings merging. >> * the experimental approach: sort the atomics lexicographically and index >> all the subsets: "S", "fem", "nom", "sg", "S,fem", "S,nom", ..., >> "S,fem,nom,sg". With the preprocessing of the user query the same way (split >> - sort - join) it would allow us to process the same queries exactly within >> one posting. >> >> This technique is actually used in our current production index based on >> Yandex.Server engine. >> But Yandex.Server somehow makes the index size reasonable (within the order >> of magnitude of original text size), while Lucene index blows up totally ( >> >10 times original text size) and no search performance improvements appear. > > That's really odd. I would expect index to become much larger, but > search performance ought to be much faster since you run simple > TermQuery. > > Mike McCandless > > http://blog.mikemccandless.com > > --------------------------------------------------------------------- > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > For additional commands, e-mail: java-user-h...@lucene.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org