Re: [PR] [common] Introduce RTreeFileIndex for query optimization [paimon]

via GitHub Thu, 21 May 2026 19:57:57 -0700


xuzifu666 commented on PR #7919:
URL: https://github.com/apache/paimon/pull/7919#issuecomment-4514611557


   > Thanks for the update. I found two blocking correctness/contract issues in 
the current head (`a4143bf63109`).
   > 
   > 1. `RTreeFileIndexWriter` does not accept the object type passed by the 
real file-index write path for `ARRAY<DOUBLE>` columns. 
`DataFileIndexWriter.FileIndexMaintainer.write()` calls 
`fileIndexWriter.writeRecord(getter.getFieldOrNull(row))`, so an array field is 
passed as Paimon's `InternalArray` implementation, e.g. `GenericArray` / binary 
array, not as `double[]` or `java.util.List`. However, 
`RTreeFileIndexWriter.extractPoint()` only handles `List` and `double[]`. A 
normal write through the file-index path can therefore fail with:
   >    ```
   >    Cannot extract point from: org.apache.paimon.data.GenericArray
   >    ```
   > 2. Production deserialization still loses the leaf flag for non-root 
nodes. `RTreeFileIndexReader.deserializeNode()` only calls 
`node.setLeaf(isLeaf)` for the root. Child nodes are always created with `new 
RTreeNode(..., false)`, and the recursive call does not update their serialized 
`isLeaf` value. As a result, leaf children under an internal node remain 
`isLeaf=false`; after serializing a multi-level tree with 
`RTreeFileIndexWriter` and reading it back with 
`RTreeFileIndex.createReader()`, an equality query for an existing point can 
return `SKIP`.
   > 
   > I reproduced both with a small contract test:
   > 
   > * `writer.writeRecord(new GenericArray(new double[] {1.0, 2.0}))` throws 
from `RTreeFileIndexWriter.extractPoint()`.
   > * Writing 200 points, serializing with `RTreeFileIndexWriter`, 
deserializing with the production `RTreeFileIndexReader`, and querying `[50.0, 
50.0]` returns no match.
   > 
   > The existing tests pass because they mostly call `writer.write(new 
double[] {...})` directly, bypassing `FileIndexWriter.writeRecord()`, and 
`RTreeSerializationTest` uses its own test deserializer that constructs nodes 
with the serialized `isLeaf` flag, bypassing the production reader bug.
   > 
   > Please fix these before merge:
   > 
   > 1. Make the writer handle Paimon's `InternalArray` for `ARRAY<DOUBLE>` 
values, and check whether reader literals need the same treatment.
   > 2. Preserve `isLeaf` for every node during production deserialization, not 
only the root.
   > 3. Add regression coverage through the real 
`FileIndexWriter.writeRecord()` / `RTreeFileIndex.createReader()` path.
   
   Thank you for the suggestion！I made relevant changes, PTAL @leaves12138 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [PR] [common] Introduce RTreeFileIndex for query optimization [paimon]

Reply via email to