Hi, 2014-10-31 13:37 GMT+01:00 Uwe Schindler <[email protected]>:
> Hi, > > > > be aware that FieldValueFilter uses FieldCache (and may possibly use > DocValues, if indexed with that – I am not sure for this case), > yes, I'm trying with (binary/sorted) docValues. > so it might be slower on the first run. In any case, as this is a BitSet > filter, its best if executed with another query that drives the iteration. > Otherwise it is plain stupid incrementing document numbers until a match is > found. > my use case is for a plain "field xyx exists" query, so I am just interested in retrieving those documents having the field xyz with whatever value (empty string included) > > > In theory, TermRangeQuery should return the same results, but maybe you > have some issues with deleted documents? > no, that's just a testcase where I don't have deletions. > Another thing might be that the wirldcard does not match all your fields, > e.g. maybe because it’s the empty string? In theory it should match, it > would just be something to look into. Maybe there is a real bug. > the strange thing is that both WildcardQuery and TermRanqueQuery return the same (wrong) hitcount. > Which version of Lucene? > I'm using trunk > > > Is the number returned by FieldValueFilter identical to > TermRange/Wildcard? Or is it correct with respect to your other approach? > the FieldValueFilter and the TermQuery (meaning I index each doc's field names into another field and search for fields:xyz) return the right number (100k), while TermRangeQuery and WildcardQuery both return less hits, I figured out it's because of empty Strings, as you said this should be working though. Regards, Tommaso > > > Uwe > > > > ----- > > Uwe Schindler > > H.-H.-Meier-Allee 63, D-28213 Bremen > > http://www.thetaphi.de > > eMail: [email protected] > > > > *From:* Tommaso Teofili [mailto:[email protected]] > *Sent:* Friday, October 31, 2014 1:21 PM > *To:* [email protected] > *Subject:* Re: "field exists" queries and benchmarks > > > > thanks Uwe! > > > > Performances do not seem much different (WildcardQuery seem to dominate), > are there any specific docValue settings to make that work the best? > > > > One more question, does anyone know why TermRangeQuery (somefield:[* TO > *]) and WildcardQuery (somefield:*) do not return the exact number of docs > having that field? See my test output for a field all 100k documents have > (with a random value): > > [junit4] 1> changing:[* TO *] > > [junit4] 1> 99526 hits > > [junit4] 1> changing:* > > [junit4] 1> 99526 hits > > [junit4] 1> fields:changing > > [junit4] 1> 100000 hits > > > > Regards, > > Tommaso > > > > > > 2014-10-30 17:46 GMT+01:00 Uwe Schindler <[email protected]>: > > Hi, > > > > there are already a Filter available (that optimizes this special case): > > > http://lucene.apache.org/core/4_10_1/core/org/apache/lucene/search/FieldValueFilter.html > > > > To make a query out of it use ConstantScoreQuery. But this filter is > better used as real filter, because it has a bitset behind. > > > > Uwe > > > > ----- > > Uwe Schindler > > H.-H.-Meier-Allee 63, D-28213 Bremen > > http://www.thetaphi.de > > eMail: [email protected] > > > > *From:* Tommaso Teofili [mailto:[email protected]] > *Sent:* Thursday, October 30, 2014 5:34 PM > *To:* [email protected] > *Subject:* "field exists" queries and benchmarks > > > > Hi all, > > > > I'm doing some (rough) tests / benchmarks in order to understand what's > the best way of doing a "field exists" query. > > > > As far as I could find we can use TermRangeQuery (somefield:[* TO *]), > WildcardQuery (somefield:*) or a plain TermQuery on another field where the > doc's fieldnames have been indexed (fields:somfield). > > > > Besides some other suggestion on how to accomplish that (very much > welcome), I'd like to understand what is the expected performance of each > of the above approaches because in my case the TermRangeQuery seems to be > the less performant while the other 2 are on average on the same level. > > > > One strange thing is that with TermRangeQuery and WildcardQuery the > hitcount is not fully correct, I meaning that with 100k docs I get the > correct hit count only with the TermQuery approach. > > Code and sample outputs can be found at [1]. > > Any hint would be appreciated. > > > > Regards, > > Tommaso > > > > [1] : https://gist.github.com/tteofili/52856d938fcd465eab58 > > >
