[
https://issues.apache.org/jira/browse/LUCENE-2110?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12786610#action_12786610
]
Uwe Schindler commented on LUCENE-2110:
---------------------------------------
>From our IRC chat between Mike and me:
{quote}
[10:19] ThetaPhi_: mike: any further comments for 2110?
[10:19] mikemccand: am just trying to look at it -- waiting on a sloooooow svn
up.
[10:19] ThetaPhi_: this out of order is not trivial and makes code for
automaton and NRQ more complicated
[10:19] mikemccand: the change is a great step forward
[10:19] mikemccand: yeah forget it
[10:19] ThetaPhi_: yes
[10:19] ThetaPhi_: you understood the problem
[10:19] mikemccand: i had thought it was nearly free
[10:19] mikemccand: i don't fully understand the problem
[10:19] mikemccand: impossible to keep up with you!!
[10:20] ThetaPhi_: when the underlying enum is exhausted (I was exhausted
yesterday, too), it stops processing, even if there are more nextSeelkTerms
[10:20] mikemccand: yes I saw your exhausted pun ;)
[10:20] mikemccand: wait -- if a TermsEnum becomes exhaused, you're unable to
seek it again? is that the issue?
[10:20] ThetaPhi_: waht you could do is, that after the underlying enum is
exhausted, that you call nextSeekTerm and reposition
[10:21] ThetaPhi_: yes
[10:21] ThetaPhi_: you could seek it
[10:21] ThetaPhi_: and this worked at the beginning (first patch without the
additional enum constants)
[10:21] ThetaPhi_: what it did was: when the underlying enum was exhausted it
simply called nextSeekTerm and seeked
[10:22] mikemccand: so TermsEnum API allows seeks after next()'ing to
exhaustion, right? (I think it should but maybe it' doesn't somewhere?)
[10:22] ThetaPhi_: which was ok
[10:22] ThetaPhi_: mikemccand: your enums works perfect
[10:22] ThetaPhi_: they can be seeked everytime even when exhausted
[10:22] mikemccand: ok
[10:23] ThetaPhi_: the problem is logic in filteredtermsenum
[10:23] mikemccand: ok
[10:23] ThetaPhi_: it gets very complicated when you want to support two things:
[10:23] ThetaPhi_: a) seeking on request (NO_AND_SEEK, YES_AND_SEEK)
[10:24] ThetaPhi_: b) and want to consume all nextSeekTerms()
[10:24] ThetaPhi_: if somebody seeks after _AND_SEEK to e.g. the last term of
the enum
[10:24] ThetaPhi_: in the current impl it will read that term
[10:25] ThetaPhi_: call termCompare and so on
[10:25] ThetaPhi_: if this accepts the term or not accepts the term
[10:25] ThetaPhi_: it will call next() on the underlying enum again
[10:25] ThetaPhi_: which then returns null
[10:26] ThetaPhi_: with the current patch it will then also end the
filteredtermsenum
[10:26] mikemccand: ok -- so the termination logic inside FTE.next() got really
hairy if on getting null from the actualEnum you had to consider going back for
another seek term?
[10:26] ThetaPhi_: yes
[10:26] ThetaPhi_: i tried it yesterday
[10:26] mikemccand: we certainly don't need to support thist oday
[10:26] ThetaPhi_: and I had no good test
[10:27] ThetaPhi_: yes exactly
[10:27] mikemccand: if somehow this ever becomes needed then we add it then
[10:27] mikemccand: the new FTE looks great
[10:27] ThetaPhi_: the idea was to provide a param to nextSeekTerm which
denotes if the underlying enum ended
[10:27] mikemccand: NRQ is soooo much simpler :)
[10:28] ThetaPhi_: the idea was to call even on exit nextSeekTerm(true), where
triue meant exhausted
[10:28] mikemccand: i see -- so it involved different API
[10:28] ThetaPhi_: TEnums like NRQ or automaton only going forward then know,
not to provide seek terms again
[10:28] ThetaPhi_: this was important for Automaton not to fall into endless
loops
[10:29] mikemccand: this is hairy stuff :)
[10:29] ThetaPhi_: and good for NRQ to not provide any further terms
[10:30] ThetaPhi_: when I tried to implement there was always a problem and you
were not able to correctly define waht should happen then
[10:30] mikemccand: how come the switch in FTE.next doesn't have "case NO" on
the return from accept(term)?
[10:31] ThetaPhi_: thats not needed it just falls through and loops to next term
[10:31] ThetaPhi_: you could add it as NO: break;
[10:31] * mikemccand ahh got it -- maybe just add comment saying so
[10:31] ThetaPhi_: good idea, will do
[10:32] ThetaPhi_: the NO_AND_seek case is the same it just says doSeek = true
[10:32] ThetaPhi_: and for yes it returns the term in both cases, but records
doSeek for the next time next() is called
[10:32] mikemccand: i wonder if we could simply add a seek term to AcceptStatus?
[10:33] mikemccand: vs calling the separate nextSeekTerm method
[10:33] ThetaPhi_: enum's contents are final
[10:33] mikemccand: not sure it'd be better... just wondering
[10:33] ThetaPhi_: because each constant is a singleton, they should be
unmodifiable
[10:34] mikemccand: ie, fix AcceptStatus to be like TermsEnum.SeekResult
[10:34] mikemccand: so it returns status, but also an optional seekTerm which
if null means no seeking
[10:34] ThetaPhi_: and how to do this, not with an enum constant
[10:34] ThetaPhi_: because they are constants
[10:35] ThetaPhi_: and createing a new object on each accept call seems not
ideal
[10:35] mikemccand: sorry, i was confused -- SeekStatus used to be a class that
had two attrs -- a TermRef, and the enum; i simplified that a while ago
[10:36] ThetaPhi_: aaaah
[10:36] ThetaPhi_: now its an enum
[10:36] ThetaPhi_: i do not think its a ood idea to return new objects
[10:36] mikemccand: right. for a while I had no TermsEnum.term()
[10:36] mikemccand: the enums would reuse their status object
[10:36] mikemccand: ie API would require that this is fine
[10:37] ThetaPhi_: but enums are singleton
[10:37] ThetaPhi_: for the whole JVM
[10:37] mikemccand: if we did that... then we could go back to YES/NO/END, and,
and seekTerm is then orthogonal
[10:37] ThetaPhi_: ok
[10:37] ThetaPhi_: so a protected setSeekTerm
[10:38] mikemccand: we'd make a new "AcceptResult" class. has "AcceptStatus
status" member, and also "TermRef seekTerm"
[10:38] ThetaPhi_: then we are back at the state before, we should ask robert
[10:38] mikemccand: ok
[10:38] ThetaPhi_: for him the separation of accept and nextSeekTerm was good
[10:38] mikemccand: ahh i see
[10:39] ThetaPhi_: but he could do the calculation of the next sring in accept,
too
[10:39] mikemccand: though, can't he simply emulate? ie, internally call a
private nextSeekTerm, stick it on the returned AcceptResult, and return that?
[10:39] mikemccand: right
[10:39] ThetaPhi_: telephone...
[10:39] mikemccand: do we ever call nextSeekTerm, unless AcceptStatus was
XXX_AND_SEEK?
[10:39] mikemccand: hah real world interrupts
[10:40] ThetaPhi_: thats what is done currentl
[10:40] ThetaPhi_: brb
[10:41] mikemccand: man all of infra seems slow right now... jira giving me
intermittent Internal Server Error... svn really slow
[10:41] mikemccand: git is tempting
[11:04] *** jwtoddii has joined #lucene.
[11:15] *** jwtoddii has signed off IRC ().
[11:15] ThetaPhi_: re
[11:15] mikemccand: hello
[11:15] ThetaPhi_: for me infra works normal
[11:16] mikemccand: hmmm
[11:16] ThetaPhi_: but maybe because i am on the euorpe svn
[11:16] mikemccand: you use the eu mirrors?
[11:16] mikemccand: ok
[11:16] ThetaPhi_: but for jira i do not know
[11:16] mikemccand: jira is on/off for me now
[11:16] ThetaPhi_: do not think there is a mirror
[11:17] ThetaPhi_: [10:39] mikemccand: do we ever call nextSeekTerm, unless
AcceptStatus was XXX_AND_SEEK?
[11:17] ThetaPhi_: no
[11:18] mikemccand: ok
[11:18] ThetaPhi_: and doSeek is always reset to false whenever a seek occured
[11:19] ThetaPhi_: the global doSeek is just for the YES_AND_SEEK case
[11:19] ThetaPhi_: because it exits the loop and returns the term
[11:19] ThetaPhi_: so on the next call to next() doSeek is true and we seek
[11:20] mikemccand: got it
[11:21] ThetaPhi_: an idea about the nextSeekTerm:
[11:21] ThetaPhi_: we add a protected setSeekTerm(TermRef)
[11:21] ThetaPhi_: this sets a private var
[11:22] ThetaPhi_: accept can set this and return YES or no wahtever
[11:22] ThetaPhi_: in the case of END it is ignored
[11:22] ThetaPhi_: but when next() is then called again, if this internal seek
varaible is != null it seeks, else it goes forward with next
[11:22] ThetaPhi_: and after seek it is set to null
[11:23] ThetaPhi_: and we could add an assert inside this method to check if
only seeking forward
[11:24] ThetaPhi_: to inform implementors of errors
[11:24] ThetaPhi_: assert termComp.compare(seekTerm, term()) > 0;
[11:24] mikemccand: why not simply return the optional seek term along w/
AcceptStatus? it'd require no additional methods, and makes it clear that
seeking part of accepting.
[11:25] mikemccand: ie the existince of new methods in the api (setSeekTerm,
nextSeekTerm) enrich the api -- make you wonder when you can call them. eg can
ctor call setSeekTerm?
[11:25] ThetaPhi_: because accept status is an enum constant which is final,
singleton and
[11:25] ThetaPhi_: yes ctor can do it at the moment
[11:25] ThetaPhi_: its already there its only called setinitialseekterm
[11:26] ThetaPhi_: (see docs)
[11:26] mikemccand: i mean make a new class (AcceptResult), like i described.
it contains an AcceptResult.Status (the enum), and a TermRef.
[11:26] ThetaPhi_: the problem is lots of object creation
[11:26] mikemccand: there would be no object creation -- we'd reuse?
[11:26] ThetaPhi_: in my opinion setSeekTerm is simplier to use
[11:27] ThetaPhi_: it would be the same like the reuse case
[11:27] ThetaPhi_: and setseekterm can also be used in ctor
[11:27] mikemccand: yeah the fact that ctor needs to set initial term does seem
to require the extra method
[11:28] mikemccand: though i also don't like side-effect methods -- you change
the internal state of the class, vs returning the seek term
[11:28] mikemccand: ie it makes the api stateful, which except for the initial
case, is overkill
[11:29] ThetaPhi_: its an iterator it always has an internal state
[11:29] ThetaPhi_: i understand your problem
[11:29] ThetaPhi_: you call in a method that should only accept something a
method that changes state
[11:30] mikemccand: the internal state is handled by FTE; my subclass is mostly
stateless, if i return the seek term
[11:30] ThetaPhi_: so you cannot call it from outside (which would not wrok,
because it is protected)
[11:30] ThetaPhi_: NRQ also has a state and also automaton
[11:31] ThetaPhi_: in NRQ it is the linkedlist with seek terms
[11:31] ThetaPhi_: in Automaton the nextString()
[11:31] mikemccand: right, they have their own state because of how they
iterate. it's just that seek term need not be stateful.
[11:32] ThetaPhi_: i think we should ask robert, at the moment i would know how
to change his code to support both cases
[11:33] mikemccand: in fact, instead of setInitialSeekTerm, could we have
getInitialSeekTerm? ie FTE would invoke that once on start
[11:33] mikemccand: ok :)
[11:33] mikemccand: automaton is scary
[11:33] mikemccand: and, powerful
[11:33] ThetaPhi_: the getInitialSeek term in the current api is hidden behind
the iterator
[11:34] ThetaPhi_: setInitialSeekTerm for the ctor is just a convenience
[11:34] ThetaPhi_: to prevent subclasses like prefixtermsenum from overrideing
and implementing the singleton itertor
[11:34] * mikemccand w/ getInitialSeekTerm, there is no sneaky shared state w/
FTE. i mean, FTE keeps track of its state, and each subclass tracks its own
state. subclassing wouldn't even be necessary anymore -- one could provdie a
standonalone TermsEnumFilter, that just has getInitialSeekTerm and accept
[11:34] ThetaPhi_: (see docs)
[11:36] mikemccand: in general I don't like subclassing APIs -- it's a bigger
surface area (how child works w/ parent, stateful methods, when can i call each
method, etc.)
[11:36] ThetaPhi_: we could remove that also in the current api
[11:37] ThetaPhi_: only abstract classes should be subclassable
[11:37] ThetaPhi_: thats asked very often
[11:37] mikemccand: gonna be a cold run this AM -- 28F out there
[11:37] ThetaPhi_: *calculating*
[11:37] mikemccand: :)
[11:37] mikemccand: invert 32 + 9/5 * C
[11:38] ThetaPhi_: or ask google
[11:38] ThetaPhi_: -2.22 °C
[11:39] ThetaPhi_: (Link:
http://www.google.de/search?hl=de&safe=off&q=28+fahrenheit+in+celsius&meta=&aq=f&oq=)http://www.google.de/search?hl=de&safe=off&q=28+fahrenheit+in+celsius&meta=&aq=f&oq=
[11:39] mikemccand: if i jump in and see that, to use FTE i only have to
implement to an interface (TermsEnumFilter) that has getInitialTerm/accept... I
think that's more approachable than figuring out the relationship (methods,
state) to an abstract parent class
[11:39] mikemccand: nice
[11:40] mikemccand: ugh, zillions of conflicts on backporting thread safe
spellchecker
[11:40] ThetaPhi_: if the abstract parent class makes all other methods final
its the same
[11:41] ThetaPhi_: but ok, you could change MTQ to just return the interface
[11:41] mikemccand: not really the same -- you have to define when each
overridden method is allowed/supposed to invoke the methods from the parent
[11:42] mikemccand: once (if) we make the interaction stateless,we suddenly
have the freedom to make it a separate interface...
[11:43] ThetaPhi_: you mean calling next() inside accept() -> loops forever
[11:43] mikemccand: yeah :(
[11:43] mikemccand: of course... that freedom of the orig FilteredTermEnum is
what made NRQ sneakiness possible in the first place ;)
[11:44] ThetaPhi_: (which was a hack, you have seen the comment: something
like: this relys on how setEnum() works, if this changes in the superclass this
enum will no longer work)
[11:45] ThetaPhi_: this was a comment before i removed the recursion in one of
this issues
[11:45] mikemccand: right -- but it was the right hting to do at the time
[11:45] mikemccand: sure was tricky to undertsand :)
[11:45] ThetaPhi_: (ore remove the recursion, like it is now in trunk)
[11:45] mikemccand: you were bound by FTE
[11:45] mikemccand: yes, that's better
[11:45] mikemccand: but i like separate interface best -- it removes all shared
state -- one thing describes what's filtered, the other implements according to
that
[11:46] mikemccand: then MTQ could almost simply accept a TermsEnum. the only
difference is difference() -- ha
[11:46] mikemccand: it's only fuzzy that uses difference() right?
[11:46] ThetaPhi_: yes
[11:46] ThetaPhi_: i was thinking about that
[11:47] ThetaPhi_: because a empty prefix filter could simply return the
termsenum
[11:47] mikemccand: ahh yes nice opto
[11:47] ThetaPhi_: PrefixQuery("")
[11:47] ThetaPhi_: would match all documents that have at least any term
[11:47] mikemccand: actually how come contrib/queries' TermsFilter isn't a
query filter wrapper around an MTQ?
[11:47] ThetaPhi_: in automaton we need this for the catch all .* case
[11:47] mikemccand: ahh
[11:48] ThetaPhi_: at the moment it generates a PrefixTermsEnum("")
[11:48] mikemccand: so maybe AcceptResult has a float difference?
[11:48] ThetaPhi_: which is slower than just returning the TermsEnum of the
reader, just because of difference
[11:49] ThetaPhi_: because it calls startsWith for each term (which in fact
does nothing because termref.length==0
[11:49] mikemccand: ugh
[11:49] ThetaPhi_: so not that bad
[11:49] mikemccand: much better to simple return .terms() for field. or,
rewrite to MatchAllDocs
[11:49] ThetaPhi_: or a termrangequery(null,null) (often used in solr)
[11:49] ThetaPhi_: no
[11:49] ThetaPhi_: MatchAllDocs is different
[11:50] ThetaPhi_: because the Prefix("") case only returns document that have
at least one term of that field
[11:50] mikemccand: ahhh yes
[11:50] ThetaPhi_: this is needed for facetting i think in Solr
[11:50] ThetaPhi_: not sure
[11:50] mikemccand: got it
[11:50] ThetaPhi_: in NRQ its fast, it just enumerates all low-prec terms
[11:51] ThetaPhi_: 16 for precStep=4
[11:51] mikemccand: nice
[11:51] mikemccand: NRQ is a great step forward for lucene
[11:51] mikemccand: would be nice if MTQ could simply accept a TermsEnum
[11:52] mikemccand: problem is difference()
[11:52] ThetaPhi_: yes
[11:52] ThetaPhi_: and is only used by fuzzy
[11:52] mikemccand: yes, annoying for just that one case
[11:52] mikemccand: contrib's TermsFilter really ought to be an MTQ
[11:52] ThetaPhi_: because robert asked for a alltermsenum, which is stupied
[11:53] mikemccand: yeah
[11:53] ThetaPhi_: i only said, user prefix("") whoich is ok, the overhead is
minimal, but in this case it would be zero
[11:53] mikemccand: we could allow MTQ to accept either a TersmsEnum, or, a
TermsEnumFilter (this new stateless interface, with AcceptResult also having a
float difference field)
[11:53] ThetaPhi_: aaaah
[11:54] ThetaPhi_: but that looks like attributesource
[11:54] ThetaPhi_: differenceattribute
[11:54] mikemccand: yeah that's true
[11:54] ThetaPhi_: termsenums are already with attributes
[11:54] mikemccand: curious. so then MTQ could accept only TermsEnum, but, ask
for its attrs, and if that's non-null, as for differenceattr, and if htat's non
null, use it
[11:55] ThetaPhi_: fuzzyenum just add this attribute
[11:55] mikemccand: i like that!
[11:55] ThetaPhi_: and fuzzyquery requests it
[11:55] mikemccand: and MTQ respects it too
[11:55] ThetaPhi_: and the default of this attribute is 1.0
[11:55] mikemccand: ye
[11:55] mikemccand: yes
[11:55] ThetaPhi_: so mtq just asks always for this attribute in scoring
boolean rewrite
[11:56] mikemccand: yes
[11:56] mikemccand: and, MTQ is much simplified to accept a TermsEnum for its
terms
[11:56] ThetaPhi_: yes
[11:56] mikemccand: and we can make a stateless API for filtering a TermsEnum
[11:57] ThetaPhi_: by the way, I did not check ever<ything in flex
[11:57] mikemccand: that's ok -- it's an immense number of changes!
[11:57] ThetaPhi_: but if you wrap an enum, you must override attributes() to
return the attributes of the delegate
[11:57] mikemccand: i still keep finding bugs in the emulation layers.
[11:57] mikemccand: hmm you're right
[11:58] ThetaPhi_: so UnionTermsEnum or like so must override attributes() and
call delegate.attributes()
[11:58] mikemccand: or, fuzzy query could be the only one that does htis
[11:58] ThetaPhi_: yes
[11:58] ThetaPhi_: no
[11:58] ThetaPhi_: let it in MTQ
[11:58] ThetaPhi_: we yesterday talked, normally rewrite should be final in MTQ
[11:58] mikemccand: that sounds good
[11:58] ThetaPhi_: you should only change behaviour in getEnum
[11:58] mikemccand: ok
[11:59] ThetaPhi_: in trunk we currently rewrite WildCardQuery to PrefixQuery
which is itsself an MTQ
[11:59] ThetaPhi_: totally useless
[11:59] ThetaPhi_: just return PrefixEnum in getEnum
[11:59] mikemccand: ahh right
[12:00] ThetaPhi_: in flex i changed
[12:00] ThetaPhi_: the problem are bw tests in trunk, because they check the
rewritten thing, but we could simply remove the tests
[12:01] mikemccand: tests shouldn't check internals like that
[12:01] ThetaPhi_: yes
[12:01] ThetaPhi_: or these checks should be marked
[12:01] mikemccand: yes
[12:01] ThetaPhi_: in Junit4 with an annotation
[12:01] mikemccand: ahh what annotation?
[12:01] ThetaPhi_: and the bw tests only run tests without that annotation
[12:02] mikemccand: that sounds great
[12:02] ThetaPhi_: the problem may then still compilation
[12:03] ThetaPhi_: if it checks package protected fields and so on
[12:03] ThetaPhi_: brrr
[12:03] ThetaPhi_: we have such tests
[12:03] mikemccand: sigh.
[12:03] mikemccand: yes
[12:03] mikemccand: i remember committing them ;)
[12:03] *** mikemccand has left #lucene.
[12:03] ThetaPhi_: you always have to remove them
[12:22] *** mikemccand has joined #lucene.
[12:22] mikemccand: sorry, dropped off
[12:25] ThetaPhi_: ok i said
[12:25] ThetaPhi_: i will try out a little bit
[12:25] ThetaPhi_: and post in the automaton
[12:25] mikemccand: ok
[12:25] ThetaPhi_: its hard to always generate both pacthes
[12:26] ThetaPhi_: when we are happy, i will post a patch to 2110 and then we
commit
[12:26] ThetaPhi_: in automaton there is more special cases to test
{quote}
> Change FilteredTermsEnum to work like Iterator, so it is not positioned and
> next() must be always called first. Remove empty()
> ------------------------------------------------------------------------------------------------------------------------------
>
> Key: LUCENE-2110
> URL: https://issues.apache.org/jira/browse/LUCENE-2110
> Project: Lucene - Java
> Issue Type: Improvement
> Components: Search
> Affects Versions: Flex Branch
> Reporter: Uwe Schindler
> Assignee: Uwe Schindler
> Fix For: Flex Branch
>
> Attachments: LUCENE-2110.patch, LUCENE-2110.patch, LUCENE-2110.patch,
> LUCENE-2110.patch
>
>
> FilteredTermsEnum is confusing as it is initially positioned to the first
> term. It should instead work like an uninitialized TermsEnum for a field
> before the first call to next() or seek().
> Also document that not all FilteredTermsEnums may implement seek() as eg. NRQ
> or Automaton are not able to support this. Seeking is also not needed for MTQ
> at all, so seek can just throw UOE.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]