rok commented on PR #854:
URL: https://github.com/apache/arrow-go/pull/854#issuecomment-4753051194

   Thanks for the very thorough review, @zeroshade — much appreciated. Pushed 
`e667d465` addressing the actionable items.
   
   ### Format/spec — field-id alignment
   - Changed `SchemaElement.vector_length` from Thrift field id **12 → 11** to 
match the `pitrou:vector-repetition` draft, so files interoperate with a 
draft-conformant reader. Updated both `parquet_vector.thrift` and the 
hand-applied `gen-go/parquet.go` (struct tag, Read/Write field id + dispatch). 
The thrift round-trip test goes through the real serializer, so write-id == 
read-id == 11 is enforced end-to-end. `FieldRepetitionType.VECTOR = 3` already 
matched.
   - On the C++ prototype's 3-level group + `Vector` logical type vs this 
reduced primitive-leaf form: that's the larger representational question for 
the proposal to converge on. I kept the leaf-only encoding since it's what the 
`pitrou:vector-repetition` draft models, but happy to follow whichever the 
format settles on.
   
   ### Worth addressing before graduating
   1. **Panic-as-validation on the public writer** — added 
`file.ErrVectorBatchMisaligned` and an up-front `validateVectorBatch(n)` in 
every public write path (`WriteBatch`, `WriteBatchSpacedWithError`, 
`WriteBitmapBatch`/`...Spaced`, `WriteDictIndices`). A misaligned batch now 
returns the typed error instead of a recovered `"unknown error type"`. The 
internal `vectorLengthForBatch` panic stays as an (now unreachable) invariant 
guard.
   2. **`NewSchemaChecked` can still panic** — `effectiveVectorLength` now 
returns an error, threaded via a new `NewColumnChecked` through `buildTree`, so 
`NewSchemaChecked` returns it. `NewColumn` still panics for direct 
construction. (Still unreachable until nested vectors exist, but the documented 
contract now holds.)
   3. **Cross-version silent misread** — documented explicitly on 
`WithVectorEncoding`: a pre-VECTOR reader may misread repetition-type-3 as a 
flat required column with the wrong row count rather than failing loudly. Agree 
a read-side feature guard is worth adding before this is non-experimental.
   4. **Statistics stay element-level** — kept; the intent is now stated in 
code. Agree it's a format-proposal decision to make explicit.
   
   ### Minor
   - Element named `"element"` and element/list metadata dropped without 
`WithStoreSchema` — added a doc note (values, element type, list size, and 
shape always round-trip).
   - V1 + offset-index multi-page seek and malformed `vector_length` rejection 
are already covered (`TestVectorColumnReaderSeekToRowWithOffsetIndex`, 
`TestVectorValidationErrors`); added explicit typed-error coverage for the 
misaligned `WriteBatch`/`WriteBatchSpaced` paths.
   - DataPageV1-without-offset-index double scan: left as-is (correct result; a 
micro-optimization for later).
   
   Left the statistics semantics and the group-vs-leaf representation as open 
proposal-convergence questions rather than changing them here. Thanks again!
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to