shaie commented on PR #841:
URL: https://github.com/apache/lucene/pull/841#issuecomment-1150940692

   > For the single point case (`ExactFSM`), this is crazy right? Even if there 
are a small number of provided `ExactFSM` instances we're matching against, 
doing a linear scan of all of them for every point is pretty dang inefficient. 
Especially so for a case where there are many provided hits with many indexed 
points for each.
   
   That's true, and hence why we discussed having a `fastMatchQuery` 
optimization too (which will skip over hits that don't eg. have any "Ford" or 
"2010" dimensions). That's an optimization we should (IMO) add in the (near) 
future, after this PR.
   
   > I personally think we should design with this optimization in mind, but I 
think we're close and I don't actually think the current proposal needs to 
really change to allow for future optimizations.
   
   I agree! I think it's fine if we'll leave these optimizations for later, and 
even if that will change the API between `MFSC` and `FSM`, it's not a big deal 
yet.
   
   > if we want to put these optimizations in place, could we not just add a 
method to `FSM` that exposes the min/max values for each dimension?
   
   We certainly can add such API. For "exact" matches it will return `min=max` 
right? Only for range ones they will be different. Are you proposing to return 
a `min[]` and `max[]` arrays, one entry per dimension? Just to make sure I 
understood your proposal (it doesn't have to be two arrays, but you understand 
the question).
   
   > we would let `MatchingFacetSetsCount` be able to take a look at the 
`FSM`'s passed to it and then determine if it should put the FSM into an R 
tree, KD tree, or just linearly scan based on the `min` and `max` of each`FSM`
   
   Not intending to start a discussion on how to implement that, but just 
wanted to point out that `fastMatchQuery` is something we'll need anyway (I 
guess) for drilldown, therefore it might be worth to start with it first? And, 
I'd rather we have some baseline benchmark before we implement any 
optimization, preferably for several use cases, so that in the end can propose 
which impl to use. E.g. maybe if you pass 1-2 `FSMs`, we won't need to create 
R/KD trees (they might even perform worse)? Anyway let's leave that for later.
   
   > With that being said, I do plan on writing the KD and R tree optimizations 
as soon as this is merged so I am still for this remaining a `long[]` API.
   
   I don't mind if we do that, but since it seems like a trivial change to make 
after (it doesn't affect the end-user API, only the internal protocol between 
`MFSC` and `FSM`), and since maybe `fastMatchQuery` will be good enough to 
speed up `FacetSet` aggregations, we may not need to have a `long[]` API 
eventually? Just saying I feel like it's not a decision we have to take now. We 
can also add a helper `toLong[]` method to begin with, so that `MFSC` can 
convert the bytes to longs, do whatever R/KD-Tree-iness it needs and then call 
`FSM`? I don't know if what I wrote even makes sense, but I feel like changing 
the API to fit the optimizations when we'll come to implement them will be much 
clearer since we'll know what we need.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Reply via email to