leaves12138 commented on PR #7934:
URL: https://github.com/apache/paimon/pull/7934#issuecomment-4515692955

   Thanks for the PR. I found one correctness blocker and a couple of 
contract/resource issues in the current head (`46f7352`).
   
   1. Nested `ROW` projection can return wrong data.
   
      `RowFileFormat.createReaderFactory()` only computes a top-level field-id 
mapping and `RowFileRecordIterator` applies it with `ProjectedRow`, which also 
only projects top-level fields. This does not preserve nested projection 
semantics. For example, if the stored schema is `r ROW<a INT, b INT>` and the 
projected schema is `r ROW<b INT>`, reading the projected row currently returns 
`a` as the first nested field instead of `b`.
   
      I reproduced it with a small contract test: write 
`GenericRow.of(GenericRow.of(1, 100))` with schema `ROW<a,b>`, read with 
projected schema `ROW<b>`, and `row.getRow(0, 1).getInt(0)` returns `1` instead 
of `100`.
   
      This is risky for schema evolution / nested field pruning. Please either 
implement nested projection correctly in the row format reader, or make sure 
this format reads the full nested type and lets the existing schema-evolution 
mapping layer apply the nested projection.
   
   2. `validateDataFields()` currently lets unsupported types pass.
   
      The method is empty and says all Paimon types are supported, but 
`RowBlockWriter` / `RowBlockReader` do not handle at least `BLOB` and `VECTOR`. 
As a result, table/schema validation can succeed and the failure is deferred to 
write/read time. I reproduced this with `VECTOR`: 
`format.validateDataFields(new RowType(... VECTOR ...))` does not throw.
   
      Please either add real support for these type roots or reject unsupported 
types recursively in `validateDataFields()`, and add regression coverage.
   
   3. The prefetcher uses an unbounded static cached thread pool.
   
      `BlockPrefetcher` uses `Executors.newCachedThreadPool()`. Each reader 
only prefetches a few ranges, but with many concurrent splits this can still 
create an unbounded number of threads. Also, when a reader is closed while 
prefetch tasks are in flight, `InputStreamPool.close()` only closes streams 
currently in the available queue; streams borrowed by cancelled/in-flight tasks 
may be returned after close and not closed.
   
      Please consider using a bounded daemon executor (or an executor 
owned/configured by the reader/runtime) and make `InputStreamPool` close-safe 
for streams returned after close.
   
   The existing new tests pass for me:
   
   ```bash
   mvn -pl paimon-format -DskipITs -Dcheckstyle.skip -Drat.skip=true 
-Dspotless.check.skip=true -Dtest='org.apache.paimon.format.row.*Test' test
   ```
   
   But the two temporary contract tests above expose the nested projection and 
validation issues.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to