gsmiller commented on PR #841: URL: https://github.com/apache/lucene/pull/841#issuecomment-1150439971
Let me try to summarize my understanding of the "future optimization" debate as it pertains to this proposed API/implementation and see if we're on the same page or if I'm overlooking something. The current proposal encapsulates/delegates matching logic behind `FacetSetMatcher`, which is responsible for determining whether-or-not that `FSM` instance matches a provided point. There's `RangeFSM` which knows how to match based on whether-or-not the point is contained in the n-dim range, and there's `ExactFSM` which is just doing exact point equivalence. The "protocol" for doing facet aggregation/counting is implemented inside `MatchingFacetSetsCounts`, which delegates to the `FSM#matches` method. The inefficiency is that—because `MatchingFacetSetsCounts` can make no assumptions about the `FSM` instances provided to it, it must iterate _all_ provided `FSM` instances for every indexed point for every provided hit. For the single point case (`ExactFSM`), this is crazy right? Even if there are a small number of provided `ExactFSM` instances we're matching against, doing a linear scan of all of them for every point is pretty dang inefficient. Especially so for a case where there are many provided hits with many indexed points for each. I think the same logic generally holds true for the range case as well, but maybe that's more debatable. But, the problem is, `MatchingFacetSetsCounts` doesn't know anything about these `FSM` instances it gets and can't do anything other than just ask them all, "hey, do you match this point?" And so the debate seems to be how to setup the API to allow for future optimizations, or if we should even worry about it at all. I personally think we should design with this optimization in mind, but I think we're close and I don't actually think the current proposal needs to really change to allow for future optimizations. This is where I get a little fuzzy on the conversation that's taken place as I haven't totally caught up on the various proposals, and conversations taking place. But, if we kept the implementation as it currently exists, in the future, if we want to put these optimizations in place, could we not just add a method to `FSM` that exposes the min/max values for each dimension? Then, `MatchingFacetSetsCounts` could inspect the underlying "hyperrectangles" represented by each `FSM` by looking at these min/max values before it counts and decide how it wants to proceed. The `matches` method is potentially still useful/needed depending on how flexible we want this thing to be; if a user creates an `FSM` implementation that's more complicated than just a "hyperrectangle" (e.g., some complicated geometry), the contract for providing min/max could be that the provided min/max is a bounding box, but `matches` still needs to be called to actually confirm the match. I point this out not to propose implementing it now, but to say that I think we have options to extend what is currently proposed here if/when we want to optimize. Does this make sense or am I talking about a completely different problem or missing key parts of the conversation that's happened? Apologies if I have. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org