The problem here is mainly with the Sorted*DocValues APIs, which return a TermsEnum but don’t have a Terms instance to call intersect on. So maybe the thing to do is to add a termsEnum(CompiledAutomaton) method to SortedDocValue and SortedSetDocValues? That should avoid the trap of bypassing Terms.intersect() when it’s available. I’ll try and work up a patch.
Alan Woodward www.flax.co.uk > On 6 Jan 2017, at 21:44, Michael McCandless <luc...@mikemccandless.com> wrote: > > Unfortunately I think that's somewhat dangerous because it creates an > ambiguous API with a nasty performance trap? > > I.e. this new method won't invoke the fast Terms.intersect in the > default terms dict? > > Mike McCandless > > http://blog.mikemccandless.com > > > On Fri, Jan 6, 2017 at 3:20 PM, Alan Woodward <a...@flax.co.uk> wrote: >> Hm, how about something like this, on CompiledAutomaton: >> >> public TermsEnum getTermsEnum(TermsEnum te) throws IOException { >> switch (type) { >> case NONE: >> return TermsEnum.EMPTY; >> case ALL: >> return te; >> case SINGLE: >> return new SingleTermsEnum(te, term); >> case NORMAL: >> return new AutomatonTermsEnum(te, this); >> default: >> // unreachable >> throw new RuntimeException("unhandled case"); >> } >> } >> >> >> Alan Woodward >> www.flax.co.uk >> >> >> On 6 Jan 2017, at 19:16, Michael McCandless <luc...@mikemccandless.com> >> wrote: >> >> These automaton intersection APIs are frustrating with all the special >> case handling... Ideas welcome! >> >> We've had similar challenges with them in the past, when a user >> invoked Terms.intersect directly instead of via CompiledAutomaton: >> https://issues.apache.org/jira/browse/LUCENE-7576 >> >> The problem is CompiledAutomaton specializes certain cases (all >> strings match, no strings match, single term) and sidesteps >> Terms.intersect for those cases. >> >> We should fix AutomatonTermsEnum public ctor w/ the same checks >> (insist on a NORMAL case) so you don't hit assert failures, or, worse >> ... I'll do that. >> >> I think a new CompiledAutomaton.intersect taking TermsEnum would be >> tricky in general because it relies on the (efficient) Terms.intersect >> to handle the NORMAL case well, but we can't invoke that from a >> TermsEnum. >> >> In the SINGLE case, could you use SingleTermsEnum, passing the >> TermsEnum from your doc values, and the term from the >> CompiledAutomaton? Would that suffice as a workaround? >> >> Mike McCandless >> >> http://blog.mikemccandless.com >> >> On Fri, Jan 6, 2017 at 11:17 AM, Alan Woodward <a...@flax.co.uk> wrote: >> >> We’ve hit an issue while developing marple, where we want to have the >> ability to filter the values from a SortedDocValues terms dictionary. >> Normally you’d create a CompiledAutomaton from the filter string, and then >> call #getTermsEnum(Terms) on it; but for docvalues, we don’t have a Terms >> instance, we instead have a TermsEnum. >> >> Using AutomatonTermsEnum to wrap the TermsEnum works in most cases here, but >> if the CompiledAutomaton in question is a fixed string, then we get >> assertion failures, because ATE uses the compiled automaton’s internal >> ByteRunAutomaton for filtering, and fixed-string automata don’t have one. >> >> Is there a work-around that I’m missing here? Or should I maybe open a JIRA >> to add a #getTermsEnum(TermsEnum) method to CompiledAutomaton? >> >> Alan Woodward >> www.flax.co.uk >> >> >> >> --------------------------------------------------------------------- >> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org >> For additional commands, e-mail: dev-h...@lucene.apache.org >> >> > > --------------------------------------------------------------------- > To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org > For additional commands, e-mail: dev-h...@lucene.apache.org >