[ 
https://issues.apache.org/jira/browse/LUCENE-9154?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17020691#comment-17020691
 ] 

Robert Muir commented on LUCENE-9154:
-------------------------------------

Documents are the only thing that are quantized/encoded. The user's data is 
quantized (once) into the index to fit into 4-byte integers. That's a tradeoff 
to save space. It is transparent too (you can see the values if you add a 
docvalues field and fetch them)

Queries are *not* quantized/encoded. So if you look at LatLonPoint's distance 
or polygon, it doesn't encode anything. For example it computes haversin 
distance from P1 to P2 where P1 is unadulterated, whatever you passed in.

As an *implementation detail* the bounding box query quantizes/encodes stuff to 
an ordinary PointRangeQuery. It does this because it it is the fastest way to 
do it. But this is an *implementation detail*. 

The crazy logic here goes to a lot of work to make sure it behaves the same as 
if you were to replace the logic with a "SlowBoundingBoxQuery". If you were to 
write such a query, and a user passed in 90.0 as a minimum latitude, it would 
match always match no documents (it simply cannot exist in the index).

So I am really opposed to the change, sorry, I think there is a big 
misunderstanding. We should not be "double-quantizing" or adding fuzzy logic or 
inconsistencies. The "hairy" logic is just to make sure that it behaves 
hypercorrect for all corner cases, even though it is doing sneaky stuff so that 
it can be implemented with a fast 2D Range Query.

> Remove encodeCeil()  to encode bounding box queries
> ---------------------------------------------------
>
>                 Key: LUCENE-9154
>                 URL: https://issues.apache.org/jira/browse/LUCENE-9154
>             Project: Lucene - Core
>          Issue Type: Improvement
>            Reporter: Ignacio Vera
>            Priority: Major
>          Time Spent: 10m
>  Remaining Estimate: 0h
>
> We currently have the following logic in LatLonPoint#newBoxquery():
> {code:java}
>  // exact double values of lat=90.0D and lon=180.0D must be treated special 
> as they are not represented in the encoding
> // and should not drag in extra bogus junk! TODO: should encodeCeil just 
> throw ArithmeticException to be less trappy here?
> if (minLatitude == 90.0) {
>   // range cannot match as 90.0 can never exist
>   return new MatchNoDocsQuery("LatLonPoint.newBoxQuery with 
> minLatitude=90.0");
> }
> if (minLongitude == 180.0) {
>   if (maxLongitude == 180.0) {
>     // range cannot match as 180.0 can never exist
>     return new MatchNoDocsQuery("LatLonPoint.newBoxQuery with 
> minLongitude=maxLongitude=180.0");
>   } else if (maxLongitude < minLongitude) {
>     // encodeCeil() with dateline wrapping!
>     minLongitude = -180.0;
>   }
> }
> byte[] lower = encodeCeil(minLatitude, minLongitude);
> byte[] upper = encode(maxLatitude, maxLongitude);
> {code}
>  
> IMO opinion this is confusing and can lead to strange results. For example a 
> query with {{minLatitude = minLatitude = 90}} does not match points with 
> {{latitude = 90}}. On the other hand a query with {{minLatitude = 
> minLatitude}} = 89.99999996}} will match points at latitude = 90.
> I don't really understand the statement that says: {{90.0 can never exist}} 
> as this is as well true for values > 89.99999995809048 which is the maximum 
> quantize value. In this argument, this will be true for all values between 
> quantize coordinates as they do not exist in the index, why 90D is so 
> special? I guess because it cannot be ceil up without overflowing the 
> encoding.
> Another argument to remove this function is that it opens the room to have 
> false negatives in the result of the query. if a query has minLon = 
> 89.999999957, it won't match points with longitude = 89.999999957 as it is 
> rounded up to 89.99999995809048.
> The only merit I can see in the current approach is that if you only index 
> points that are already quantize, then all queries would be exact. But does 
> it make sense for someone to only index quantize values and then query by 
> non-quantize bounding boxes?
>  
> I hope I am missing something, but my proposal is to remove encodeCeil all 
> together and remove all the special handling at the positive pole and 
> positive dateline.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to