leaves12138 commented on PR #7934:
URL: https://github.com/apache/paimon/pull/7934#issuecomment-4515692955
Thanks for the PR. I found one correctness blocker and a couple of
contract/resource issues in the current head (`46f7352`).
1. Nested `ROW` projection can return wrong data.
`RowFileFormat.createReaderFactory()` only computes a top-level field-id
mapping and `RowFileRecordIterator` applies it with `ProjectedRow`, which also
only projects top-level fields. This does not preserve nested projection
semantics. For example, if the stored schema is `r ROW<a INT, b INT>` and the
projected schema is `r ROW<b INT>`, reading the projected row currently returns
`a` as the first nested field instead of `b`.
I reproduced it with a small contract test: write
`GenericRow.of(GenericRow.of(1, 100))` with schema `ROW<a,b>`, read with
projected schema `ROW<b>`, and `row.getRow(0, 1).getInt(0)` returns `1` instead
of `100`.
This is risky for schema evolution / nested field pruning. Please either
implement nested projection correctly in the row format reader, or make sure
this format reads the full nested type and lets the existing schema-evolution
mapping layer apply the nested projection.
2. `validateDataFields()` currently lets unsupported types pass.
The method is empty and says all Paimon types are supported, but
`RowBlockWriter` / `RowBlockReader` do not handle at least `BLOB` and `VECTOR`.
As a result, table/schema validation can succeed and the failure is deferred to
write/read time. I reproduced this with `VECTOR`:
`format.validateDataFields(new RowType(... VECTOR ...))` does not throw.
Please either add real support for these type roots or reject unsupported
types recursively in `validateDataFields()`, and add regression coverage.
3. The prefetcher uses an unbounded static cached thread pool.
`BlockPrefetcher` uses `Executors.newCachedThreadPool()`. Each reader
only prefetches a few ranges, but with many concurrent splits this can still
create an unbounded number of threads. Also, when a reader is closed while
prefetch tasks are in flight, `InputStreamPool.close()` only closes streams
currently in the available queue; streams borrowed by cancelled/in-flight tasks
may be returned after close and not closed.
Please consider using a bounded daemon executor (or an executor
owned/configured by the reader/runtime) and make `InputStreamPool` close-safe
for streams returned after close.
The existing new tests pass for me:
```bash
mvn -pl paimon-format -DskipITs -Dcheckstyle.skip -Drat.skip=true
-Dspotless.check.skip=true -Dtest='org.apache.paimon.format.row.*Test' test
```
But the two temporary contract tests above expose the nested projection and
validation issues.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]