rok commented on PR #854: URL: https://github.com/apache/arrow-go/pull/854#issuecomment-4753051194
Thanks for the very thorough review, @zeroshade — much appreciated. Pushed `e667d465` addressing the actionable items. ### Format/spec — field-id alignment - Changed `SchemaElement.vector_length` from Thrift field id **12 → 11** to match the `pitrou:vector-repetition` draft, so files interoperate with a draft-conformant reader. Updated both `parquet_vector.thrift` and the hand-applied `gen-go/parquet.go` (struct tag, Read/Write field id + dispatch). The thrift round-trip test goes through the real serializer, so write-id == read-id == 11 is enforced end-to-end. `FieldRepetitionType.VECTOR = 3` already matched. - On the C++ prototype's 3-level group + `Vector` logical type vs this reduced primitive-leaf form: that's the larger representational question for the proposal to converge on. I kept the leaf-only encoding since it's what the `pitrou:vector-repetition` draft models, but happy to follow whichever the format settles on. ### Worth addressing before graduating 1. **Panic-as-validation on the public writer** — added `file.ErrVectorBatchMisaligned` and an up-front `validateVectorBatch(n)` in every public write path (`WriteBatch`, `WriteBatchSpacedWithError`, `WriteBitmapBatch`/`...Spaced`, `WriteDictIndices`). A misaligned batch now returns the typed error instead of a recovered `"unknown error type"`. The internal `vectorLengthForBatch` panic stays as an (now unreachable) invariant guard. 2. **`NewSchemaChecked` can still panic** — `effectiveVectorLength` now returns an error, threaded via a new `NewColumnChecked` through `buildTree`, so `NewSchemaChecked` returns it. `NewColumn` still panics for direct construction. (Still unreachable until nested vectors exist, but the documented contract now holds.) 3. **Cross-version silent misread** — documented explicitly on `WithVectorEncoding`: a pre-VECTOR reader may misread repetition-type-3 as a flat required column with the wrong row count rather than failing loudly. Agree a read-side feature guard is worth adding before this is non-experimental. 4. **Statistics stay element-level** — kept; the intent is now stated in code. Agree it's a format-proposal decision to make explicit. ### Minor - Element named `"element"` and element/list metadata dropped without `WithStoreSchema` — added a doc note (values, element type, list size, and shape always round-trip). - V1 + offset-index multi-page seek and malformed `vector_length` rejection are already covered (`TestVectorColumnReaderSeekToRowWithOffsetIndex`, `TestVectorValidationErrors`); added explicit typed-error coverage for the misaligned `WriteBatch`/`WriteBatchSpaced` paths. - DataPageV1-without-offset-index double scan: left as-is (correct result; a micro-optimization for later). Left the statistics semantics and the group-vs-leaf representation as open proposal-convergence questions rather than changing them here. Thanks again! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
