[ https://issues.apache.org/jira/browse/LUCENE-2110?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12786610#action_12786610 ]
Uwe Schindler commented on LUCENE-2110: --------------------------------------- >From our IRC chat between Mike and me: {quote} [10:19] ThetaPhi_: mike: any further comments for 2110? [10:19] mikemccand: am just trying to look at it -- waiting on a sloooooow svn up. [10:19] ThetaPhi_: this out of order is not trivial and makes code for automaton and NRQ more complicated [10:19] mikemccand: the change is a great step forward [10:19] mikemccand: yeah forget it [10:19] ThetaPhi_: yes [10:19] ThetaPhi_: you understood the problem [10:19] mikemccand: i had thought it was nearly free [10:19] mikemccand: i don't fully understand the problem [10:19] mikemccand: impossible to keep up with you!! [10:20] ThetaPhi_: when the underlying enum is exhausted (I was exhausted yesterday, too), it stops processing, even if there are more nextSeelkTerms [10:20] mikemccand: yes I saw your exhausted pun ;) [10:20] mikemccand: wait -- if a TermsEnum becomes exhaused, you're unable to seek it again? is that the issue? [10:20] ThetaPhi_: waht you could do is, that after the underlying enum is exhausted, that you call nextSeekTerm and reposition [10:21] ThetaPhi_: yes [10:21] ThetaPhi_: you could seek it [10:21] ThetaPhi_: and this worked at the beginning (first patch without the additional enum constants) [10:21] ThetaPhi_: what it did was: when the underlying enum was exhausted it simply called nextSeekTerm and seeked [10:22] mikemccand: so TermsEnum API allows seeks after next()'ing to exhaustion, right? (I think it should but maybe it' doesn't somewhere?) [10:22] ThetaPhi_: which was ok [10:22] ThetaPhi_: mikemccand: your enums works perfect [10:22] ThetaPhi_: they can be seeked everytime even when exhausted [10:22] mikemccand: ok [10:23] ThetaPhi_: the problem is logic in filteredtermsenum [10:23] mikemccand: ok [10:23] ThetaPhi_: it gets very complicated when you want to support two things: [10:23] ThetaPhi_: a) seeking on request (NO_AND_SEEK, YES_AND_SEEK) [10:24] ThetaPhi_: b) and want to consume all nextSeekTerms() [10:24] ThetaPhi_: if somebody seeks after _AND_SEEK to e.g. the last term of the enum [10:24] ThetaPhi_: in the current impl it will read that term [10:25] ThetaPhi_: call termCompare and so on [10:25] ThetaPhi_: if this accepts the term or not accepts the term [10:25] ThetaPhi_: it will call next() on the underlying enum again [10:25] ThetaPhi_: which then returns null [10:26] ThetaPhi_: with the current patch it will then also end the filteredtermsenum [10:26] mikemccand: ok -- so the termination logic inside FTE.next() got really hairy if on getting null from the actualEnum you had to consider going back for another seek term? [10:26] ThetaPhi_: yes [10:26] ThetaPhi_: i tried it yesterday [10:26] mikemccand: we certainly don't need to support thist oday [10:26] ThetaPhi_: and I had no good test [10:27] ThetaPhi_: yes exactly [10:27] mikemccand: if somehow this ever becomes needed then we add it then [10:27] mikemccand: the new FTE looks great [10:27] ThetaPhi_: the idea was to provide a param to nextSeekTerm which denotes if the underlying enum ended [10:27] mikemccand: NRQ is soooo much simpler :) [10:28] ThetaPhi_: the idea was to call even on exit nextSeekTerm(true), where triue meant exhausted [10:28] mikemccand: i see -- so it involved different API [10:28] ThetaPhi_: TEnums like NRQ or automaton only going forward then know, not to provide seek terms again [10:28] ThetaPhi_: this was important for Automaton not to fall into endless loops [10:29] mikemccand: this is hairy stuff :) [10:29] ThetaPhi_: and good for NRQ to not provide any further terms [10:30] ThetaPhi_: when I tried to implement there was always a problem and you were not able to correctly define waht should happen then [10:30] mikemccand: how come the switch in FTE.next doesn't have "case NO" on the return from accept(term)? [10:31] ThetaPhi_: thats not needed it just falls through and loops to next term [10:31] ThetaPhi_: you could add it as NO: break; [10:31] * mikemccand ahh got it -- maybe just add comment saying so [10:31] ThetaPhi_: good idea, will do [10:32] ThetaPhi_: the NO_AND_seek case is the same it just says doSeek = true [10:32] ThetaPhi_: and for yes it returns the term in both cases, but records doSeek for the next time next() is called [10:32] mikemccand: i wonder if we could simply add a seek term to AcceptStatus? [10:33] mikemccand: vs calling the separate nextSeekTerm method [10:33] ThetaPhi_: enum's contents are final [10:33] mikemccand: not sure it'd be better... just wondering [10:33] ThetaPhi_: because each constant is a singleton, they should be unmodifiable [10:34] mikemccand: ie, fix AcceptStatus to be like TermsEnum.SeekResult [10:34] mikemccand: so it returns status, but also an optional seekTerm which if null means no seeking [10:34] ThetaPhi_: and how to do this, not with an enum constant [10:34] ThetaPhi_: because they are constants [10:35] ThetaPhi_: and createing a new object on each accept call seems not ideal [10:35] mikemccand: sorry, i was confused -- SeekStatus used to be a class that had two attrs -- a TermRef, and the enum; i simplified that a while ago [10:36] ThetaPhi_: aaaah [10:36] ThetaPhi_: now its an enum [10:36] ThetaPhi_: i do not think its a ood idea to return new objects [10:36] mikemccand: right. for a while I had no TermsEnum.term() [10:36] mikemccand: the enums would reuse their status object [10:36] mikemccand: ie API would require that this is fine [10:37] ThetaPhi_: but enums are singleton [10:37] ThetaPhi_: for the whole JVM [10:37] mikemccand: if we did that... then we could go back to YES/NO/END, and, and seekTerm is then orthogonal [10:37] ThetaPhi_: ok [10:37] ThetaPhi_: so a protected setSeekTerm [10:38] mikemccand: we'd make a new "AcceptResult" class. has "AcceptStatus status" member, and also "TermRef seekTerm" [10:38] ThetaPhi_: then we are back at the state before, we should ask robert [10:38] mikemccand: ok [10:38] ThetaPhi_: for him the separation of accept and nextSeekTerm was good [10:38] mikemccand: ahh i see [10:39] ThetaPhi_: but he could do the calculation of the next sring in accept, too [10:39] mikemccand: though, can't he simply emulate? ie, internally call a private nextSeekTerm, stick it on the returned AcceptResult, and return that? [10:39] mikemccand: right [10:39] ThetaPhi_: telephone... [10:39] mikemccand: do we ever call nextSeekTerm, unless AcceptStatus was XXX_AND_SEEK? [10:39] mikemccand: hah real world interrupts [10:40] ThetaPhi_: thats what is done currentl [10:40] ThetaPhi_: brb [10:41] mikemccand: man all of infra seems slow right now... jira giving me intermittent Internal Server Error... svn really slow [10:41] mikemccand: git is tempting [11:04] *** jwtoddii has joined #lucene. [11:15] *** jwtoddii has signed off IRC (). [11:15] ThetaPhi_: re [11:15] mikemccand: hello [11:15] ThetaPhi_: for me infra works normal [11:16] mikemccand: hmmm [11:16] ThetaPhi_: but maybe because i am on the euorpe svn [11:16] mikemccand: you use the eu mirrors? [11:16] mikemccand: ok [11:16] ThetaPhi_: but for jira i do not know [11:16] mikemccand: jira is on/off for me now [11:16] ThetaPhi_: do not think there is a mirror [11:17] ThetaPhi_: [10:39] mikemccand: do we ever call nextSeekTerm, unless AcceptStatus was XXX_AND_SEEK? [11:17] ThetaPhi_: no [11:18] mikemccand: ok [11:18] ThetaPhi_: and doSeek is always reset to false whenever a seek occured [11:19] ThetaPhi_: the global doSeek is just for the YES_AND_SEEK case [11:19] ThetaPhi_: because it exits the loop and returns the term [11:19] ThetaPhi_: so on the next call to next() doSeek is true and we seek [11:20] mikemccand: got it [11:21] ThetaPhi_: an idea about the nextSeekTerm: [11:21] ThetaPhi_: we add a protected setSeekTerm(TermRef) [11:21] ThetaPhi_: this sets a private var [11:22] ThetaPhi_: accept can set this and return YES or no wahtever [11:22] ThetaPhi_: in the case of END it is ignored [11:22] ThetaPhi_: but when next() is then called again, if this internal seek varaible is != null it seeks, else it goes forward with next [11:22] ThetaPhi_: and after seek it is set to null [11:23] ThetaPhi_: and we could add an assert inside this method to check if only seeking forward [11:24] ThetaPhi_: to inform implementors of errors [11:24] ThetaPhi_: assert termComp.compare(seekTerm, term()) > 0; [11:24] mikemccand: why not simply return the optional seek term along w/ AcceptStatus? it'd require no additional methods, and makes it clear that seeking part of accepting. [11:25] mikemccand: ie the existince of new methods in the api (setSeekTerm, nextSeekTerm) enrich the api -- make you wonder when you can call them. eg can ctor call setSeekTerm? [11:25] ThetaPhi_: because accept status is an enum constant which is final, singleton and [11:25] ThetaPhi_: yes ctor can do it at the moment [11:25] ThetaPhi_: its already there its only called setinitialseekterm [11:26] ThetaPhi_: (see docs) [11:26] mikemccand: i mean make a new class (AcceptResult), like i described. it contains an AcceptResult.Status (the enum), and a TermRef. [11:26] ThetaPhi_: the problem is lots of object creation [11:26] mikemccand: there would be no object creation -- we'd reuse? [11:26] ThetaPhi_: in my opinion setSeekTerm is simplier to use [11:27] ThetaPhi_: it would be the same like the reuse case [11:27] ThetaPhi_: and setseekterm can also be used in ctor [11:27] mikemccand: yeah the fact that ctor needs to set initial term does seem to require the extra method [11:28] mikemccand: though i also don't like side-effect methods -- you change the internal state of the class, vs returning the seek term [11:28] mikemccand: ie it makes the api stateful, which except for the initial case, is overkill [11:29] ThetaPhi_: its an iterator it always has an internal state [11:29] ThetaPhi_: i understand your problem [11:29] ThetaPhi_: you call in a method that should only accept something a method that changes state [11:30] mikemccand: the internal state is handled by FTE; my subclass is mostly stateless, if i return the seek term [11:30] ThetaPhi_: so you cannot call it from outside (which would not wrok, because it is protected) [11:30] ThetaPhi_: NRQ also has a state and also automaton [11:31] ThetaPhi_: in NRQ it is the linkedlist with seek terms [11:31] ThetaPhi_: in Automaton the nextString() [11:31] mikemccand: right, they have their own state because of how they iterate. it's just that seek term need not be stateful. [11:32] ThetaPhi_: i think we should ask robert, at the moment i would know how to change his code to support both cases [11:33] mikemccand: in fact, instead of setInitialSeekTerm, could we have getInitialSeekTerm? ie FTE would invoke that once on start [11:33] mikemccand: ok :) [11:33] mikemccand: automaton is scary [11:33] mikemccand: and, powerful [11:33] ThetaPhi_: the getInitialSeek term in the current api is hidden behind the iterator [11:34] ThetaPhi_: setInitialSeekTerm for the ctor is just a convenience [11:34] ThetaPhi_: to prevent subclasses like prefixtermsenum from overrideing and implementing the singleton itertor [11:34] * mikemccand w/ getInitialSeekTerm, there is no sneaky shared state w/ FTE. i mean, FTE keeps track of its state, and each subclass tracks its own state. subclassing wouldn't even be necessary anymore -- one could provdie a standonalone TermsEnumFilter, that just has getInitialSeekTerm and accept [11:34] ThetaPhi_: (see docs) [11:36] mikemccand: in general I don't like subclassing APIs -- it's a bigger surface area (how child works w/ parent, stateful methods, when can i call each method, etc.) [11:36] ThetaPhi_: we could remove that also in the current api [11:37] ThetaPhi_: only abstract classes should be subclassable [11:37] ThetaPhi_: thats asked very often [11:37] mikemccand: gonna be a cold run this AM -- 28F out there [11:37] ThetaPhi_: *calculating* [11:37] mikemccand: :) [11:37] mikemccand: invert 32 + 9/5 * C [11:38] ThetaPhi_: or ask google [11:38] ThetaPhi_: -2.22 °C [11:39] ThetaPhi_: (Link: http://www.google.de/search?hl=de&safe=off&q=28+fahrenheit+in+celsius&meta=&aq=f&oq=)http://www.google.de/search?hl=de&safe=off&q=28+fahrenheit+in+celsius&meta=&aq=f&oq= [11:39] mikemccand: if i jump in and see that, to use FTE i only have to implement to an interface (TermsEnumFilter) that has getInitialTerm/accept... I think that's more approachable than figuring out the relationship (methods, state) to an abstract parent class [11:39] mikemccand: nice [11:40] mikemccand: ugh, zillions of conflicts on backporting thread safe spellchecker [11:40] ThetaPhi_: if the abstract parent class makes all other methods final its the same [11:41] ThetaPhi_: but ok, you could change MTQ to just return the interface [11:41] mikemccand: not really the same -- you have to define when each overridden method is allowed/supposed to invoke the methods from the parent [11:42] mikemccand: once (if) we make the interaction stateless,we suddenly have the freedom to make it a separate interface... [11:43] ThetaPhi_: you mean calling next() inside accept() -> loops forever [11:43] mikemccand: yeah :( [11:43] mikemccand: of course... that freedom of the orig FilteredTermEnum is what made NRQ sneakiness possible in the first place ;) [11:44] ThetaPhi_: (which was a hack, you have seen the comment: something like: this relys on how setEnum() works, if this changes in the superclass this enum will no longer work) [11:45] ThetaPhi_: this was a comment before i removed the recursion in one of this issues [11:45] mikemccand: right -- but it was the right hting to do at the time [11:45] mikemccand: sure was tricky to undertsand :) [11:45] ThetaPhi_: (ore remove the recursion, like it is now in trunk) [11:45] mikemccand: you were bound by FTE [11:45] mikemccand: yes, that's better [11:45] mikemccand: but i like separate interface best -- it removes all shared state -- one thing describes what's filtered, the other implements according to that [11:46] mikemccand: then MTQ could almost simply accept a TermsEnum. the only difference is difference() -- ha [11:46] mikemccand: it's only fuzzy that uses difference() right? [11:46] ThetaPhi_: yes [11:46] ThetaPhi_: i was thinking about that [11:47] ThetaPhi_: because a empty prefix filter could simply return the termsenum [11:47] mikemccand: ahh yes nice opto [11:47] ThetaPhi_: PrefixQuery("") [11:47] ThetaPhi_: would match all documents that have at least any term [11:47] mikemccand: actually how come contrib/queries' TermsFilter isn't a query filter wrapper around an MTQ? [11:47] ThetaPhi_: in automaton we need this for the catch all .* case [11:47] mikemccand: ahh [11:48] ThetaPhi_: at the moment it generates a PrefixTermsEnum("") [11:48] mikemccand: so maybe AcceptResult has a float difference? [11:48] ThetaPhi_: which is slower than just returning the TermsEnum of the reader, just because of difference [11:49] ThetaPhi_: because it calls startsWith for each term (which in fact does nothing because termref.length==0 [11:49] mikemccand: ugh [11:49] ThetaPhi_: so not that bad [11:49] mikemccand: much better to simple return .terms() for field. or, rewrite to MatchAllDocs [11:49] ThetaPhi_: or a termrangequery(null,null) (often used in solr) [11:49] ThetaPhi_: no [11:49] ThetaPhi_: MatchAllDocs is different [11:50] ThetaPhi_: because the Prefix("") case only returns document that have at least one term of that field [11:50] mikemccand: ahhh yes [11:50] ThetaPhi_: this is needed for facetting i think in Solr [11:50] ThetaPhi_: not sure [11:50] mikemccand: got it [11:50] ThetaPhi_: in NRQ its fast, it just enumerates all low-prec terms [11:51] ThetaPhi_: 16 for precStep=4 [11:51] mikemccand: nice [11:51] mikemccand: NRQ is a great step forward for lucene [11:51] mikemccand: would be nice if MTQ could simply accept a TermsEnum [11:52] mikemccand: problem is difference() [11:52] ThetaPhi_: yes [11:52] ThetaPhi_: and is only used by fuzzy [11:52] mikemccand: yes, annoying for just that one case [11:52] mikemccand: contrib's TermsFilter really ought to be an MTQ [11:52] ThetaPhi_: because robert asked for a alltermsenum, which is stupied [11:53] mikemccand: yeah [11:53] ThetaPhi_: i only said, user prefix("") whoich is ok, the overhead is minimal, but in this case it would be zero [11:53] mikemccand: we could allow MTQ to accept either a TersmsEnum, or, a TermsEnumFilter (this new stateless interface, with AcceptResult also having a float difference field) [11:53] ThetaPhi_: aaaah [11:54] ThetaPhi_: but that looks like attributesource [11:54] ThetaPhi_: differenceattribute [11:54] mikemccand: yeah that's true [11:54] ThetaPhi_: termsenums are already with attributes [11:54] mikemccand: curious. so then MTQ could accept only TermsEnum, but, ask for its attrs, and if that's non-null, as for differenceattr, and if htat's non null, use it [11:55] ThetaPhi_: fuzzyenum just add this attribute [11:55] mikemccand: i like that! [11:55] ThetaPhi_: and fuzzyquery requests it [11:55] mikemccand: and MTQ respects it too [11:55] ThetaPhi_: and the default of this attribute is 1.0 [11:55] mikemccand: ye [11:55] mikemccand: yes [11:55] ThetaPhi_: so mtq just asks always for this attribute in scoring boolean rewrite [11:56] mikemccand: yes [11:56] mikemccand: and, MTQ is much simplified to accept a TermsEnum for its terms [11:56] ThetaPhi_: yes [11:56] mikemccand: and we can make a stateless API for filtering a TermsEnum [11:57] ThetaPhi_: by the way, I did not check ever<ything in flex [11:57] mikemccand: that's ok -- it's an immense number of changes! [11:57] ThetaPhi_: but if you wrap an enum, you must override attributes() to return the attributes of the delegate [11:57] mikemccand: i still keep finding bugs in the emulation layers. [11:57] mikemccand: hmm you're right [11:58] ThetaPhi_: so UnionTermsEnum or like so must override attributes() and call delegate.attributes() [11:58] mikemccand: or, fuzzy query could be the only one that does htis [11:58] ThetaPhi_: yes [11:58] ThetaPhi_: no [11:58] ThetaPhi_: let it in MTQ [11:58] ThetaPhi_: we yesterday talked, normally rewrite should be final in MTQ [11:58] mikemccand: that sounds good [11:58] ThetaPhi_: you should only change behaviour in getEnum [11:58] mikemccand: ok [11:59] ThetaPhi_: in trunk we currently rewrite WildCardQuery to PrefixQuery which is itsself an MTQ [11:59] ThetaPhi_: totally useless [11:59] ThetaPhi_: just return PrefixEnum in getEnum [11:59] mikemccand: ahh right [12:00] ThetaPhi_: in flex i changed [12:00] ThetaPhi_: the problem are bw tests in trunk, because they check the rewritten thing, but we could simply remove the tests [12:01] mikemccand: tests shouldn't check internals like that [12:01] ThetaPhi_: yes [12:01] ThetaPhi_: or these checks should be marked [12:01] mikemccand: yes [12:01] ThetaPhi_: in Junit4 with an annotation [12:01] mikemccand: ahh what annotation? [12:01] ThetaPhi_: and the bw tests only run tests without that annotation [12:02] mikemccand: that sounds great [12:02] ThetaPhi_: the problem may then still compilation [12:03] ThetaPhi_: if it checks package protected fields and so on [12:03] ThetaPhi_: brrr [12:03] ThetaPhi_: we have such tests [12:03] mikemccand: sigh. [12:03] mikemccand: yes [12:03] mikemccand: i remember committing them ;) [12:03] *** mikemccand has left #lucene. [12:03] ThetaPhi_: you always have to remove them [12:22] *** mikemccand has joined #lucene. [12:22] mikemccand: sorry, dropped off [12:25] ThetaPhi_: ok i said [12:25] ThetaPhi_: i will try out a little bit [12:25] ThetaPhi_: and post in the automaton [12:25] mikemccand: ok [12:25] ThetaPhi_: its hard to always generate both pacthes [12:26] ThetaPhi_: when we are happy, i will post a patch to 2110 and then we commit [12:26] ThetaPhi_: in automaton there is more special cases to test {quote} > Change FilteredTermsEnum to work like Iterator, so it is not positioned and > next() must be always called first. Remove empty() > ------------------------------------------------------------------------------------------------------------------------------ > > Key: LUCENE-2110 > URL: https://issues.apache.org/jira/browse/LUCENE-2110 > Project: Lucene - Java > Issue Type: Improvement > Components: Search > Affects Versions: Flex Branch > Reporter: Uwe Schindler > Assignee: Uwe Schindler > Fix For: Flex Branch > > Attachments: LUCENE-2110.patch, LUCENE-2110.patch, LUCENE-2110.patch, > LUCENE-2110.patch > > > FilteredTermsEnum is confusing as it is initially positioned to the first > term. It should instead work like an uninitialized TermsEnum for a field > before the first call to next() or seek(). > Also document that not all FilteredTermsEnums may implement seek() as eg. NRQ > or Automaton are not able to support this. Seeking is also not needed for MTQ > at all, so seek can just throw UOE. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. --------------------------------------------------------------------- To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org