shaie commented on PR #841: URL: https://github.com/apache/lucene/pull/841#issuecomment-1150940692
> For the single point case (`ExactFSM`), this is crazy right? Even if there are a small number of provided `ExactFSM` instances we're matching against, doing a linear scan of all of them for every point is pretty dang inefficient. Especially so for a case where there are many provided hits with many indexed points for each. That's true, and hence why we discussed having a `fastMatchQuery` optimization too (which will skip over hits that don't eg. have any "Ford" or "2010" dimensions). That's an optimization we should (IMO) add in the (near) future, after this PR. > I personally think we should design with this optimization in mind, but I think we're close and I don't actually think the current proposal needs to really change to allow for future optimizations. I agree! I think it's fine if we'll leave these optimizations for later, and even if that will change the API between `MFSC` and `FSM`, it's not a big deal yet. > if we want to put these optimizations in place, could we not just add a method to `FSM` that exposes the min/max values for each dimension? We certainly can add such API. For "exact" matches it will return `min=max` right? Only for range ones they will be different. Are you proposing to return a `min[]` and `max[]` arrays, one entry per dimension? Just to make sure I understood your proposal (it doesn't have to be two arrays, but you understand the question). > we would let `MatchingFacetSetsCount` be able to take a look at the `FSM`'s passed to it and then determine if it should put the FSM into an R tree, KD tree, or just linearly scan based on the `min` and `max` of each`FSM` Not intending to start a discussion on how to implement that, but just wanted to point out that `fastMatchQuery` is something we'll need anyway (I guess) for drilldown, therefore it might be worth to start with it first? And, I'd rather we have some baseline benchmark before we implement any optimization, preferably for several use cases, so that in the end can propose which impl to use. E.g. maybe if you pass 1-2 `FSMs`, we won't need to create R/KD trees (they might even perform worse)? Anyway let's leave that for later. > With that being said, I do plan on writing the KD and R tree optimizations as soon as this is merged so I am still for this remaining a `long[]` API. I don't mind if we do that, but since it seems like a trivial change to make after (it doesn't affect the end-user API, only the internal protocol between `MFSC` and `FSM`), and since maybe `fastMatchQuery` will be good enough to speed up `FacetSet` aggregations, we may not need to have a `long[]` API eventually? Just saying I feel like it's not a decision we have to take now. We can also add a helper `toLong[]` method to begin with, so that `MFSC` can convert the bytes to longs, do whatever R/KD-Tree-iness it needs and then call `FSM`? I don't know if what I wrote even makes sense, but I feel like changing the API to fit the optimizations when we'll come to implement them will be much clearer since we'll know what we need. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org