Re: [PR] [common] Introduce RTreeFileIndex for query optimization [paimon]

via GitHub Thu, 21 May 2026 07:27:21 -0700


leaves12138 commented on PR #7919:
URL: https://github.com/apache/paimon/pull/7919#issuecomment-4509237840


   Thanks for the update. I found two blocking correctness/contract issues in 
the current head (`a4143bf63109`).
   
   1. `RTreeFileIndexWriter` does not accept the object type passed by the real 
file-index write path for `ARRAY<DOUBLE>` columns. 
`DataFileIndexWriter.FileIndexMaintainer.write()` calls 
`fileIndexWriter.writeRecord(getter.getFieldOrNull(row))`, so an array field is 
passed as Paimon's `InternalArray` implementation, e.g. `GenericArray` / binary 
array, not as `double[]` or `java.util.List`. However, 
`RTreeFileIndexWriter.extractPoint()` only handles `List` and `double[]`. A 
normal write through the file-index path can therefore fail with:
   
      ```text
      Cannot extract point from: org.apache.paimon.data.GenericArray
      ```
   
   2. Production deserialization still loses the leaf flag for non-root nodes. 
`RTreeFileIndexReader.deserializeNode()` only calls `node.setLeaf(isLeaf)` for 
the root. Child nodes are always created with `new RTreeNode(..., false)`, and 
the recursive call does not update their serialized `isLeaf` value. As a 
result, leaf children under an internal node remain `isLeaf=false`; after 
serializing a multi-level tree with `RTreeFileIndexWriter` and reading it back 
with `RTreeFileIndex.createReader()`, an equality query for an existing point 
can return `SKIP`.
   
   I reproduced both with a small contract test:
   
   - `writer.writeRecord(new GenericArray(new double[] {1.0, 2.0}))` throws 
from `RTreeFileIndexWriter.extractPoint()`.
   - Writing 200 points, serializing with `RTreeFileIndexWriter`, deserializing 
with the production `RTreeFileIndexReader`, and querying `[50.0, 50.0]` returns 
no match.
   
   The existing tests pass because they mostly call `writer.write(new double[] 
{...})` directly, bypassing `FileIndexWriter.writeRecord()`, and 
`RTreeSerializationTest` uses its own test deserializer that constructs nodes 
with the serialized `isLeaf` flag, bypassing the production reader bug.
   
   Please fix these before merge:
   
   1. Make the writer handle Paimon's `InternalArray` for `ARRAY<DOUBLE>` 
values, and check whether reader literals need the same treatment.
   2. Preserve `isLeaf` for every node during production deserialization, not 
only the root.
   3. Add regression coverage through the real `FileIndexWriter.writeRecord()` 
/ `RTreeFileIndex.createReader()` path.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [PR] [common] Introduce RTreeFileIndex for query optimization [paimon]

Reply via email to