qzyu999 opened a new pull request, #50122:
URL: https://github.com/apache/arrow/pull/50122

   ### Rationale for this change
   This is part of the GH-45937 umbrella (Add variant support to C++ Parquet). 
It adds the encoding (writing) side of the Variant binary format, building on 
the decoder from GH-45946. The encoder is required for GH-45948 (variant 
shredding) and for any Parquet writer that needs to produce Variant columns.
   
   As with the decoder, the implementation targets feature parity with the 
[arrow-go 
`parquet/variant.Builder`](https://github.com/apache/arrow-go/tree/main/parquet/variant),
 adapted to idiomatic C++ patterns. Divergences are deliberate and documented.
   
   ### What changes are included in this PR?
   Adds `VariantBuilder` class in `variant_internal.h` / `variant_builder.cc` 
for encoding Variant binary values per the [Variant Encoding 
Spec](https://github.com/apache/parquet-format/blob/master/VariantEncoding.md).
   
   **Builder API:**
   - All 21 primitive types: `Null()`, `Bool()`, `Int()` (auto-sizes), 
`Int8/16/32/64()`, `Float()`, `Double()`, `Date()`, `TimestampMicros/NTZ()`, 
`TimestampNanos/NTZ()`, `TimeNTZ()`, `Decimal4/8/16()`, `String()` (auto 
short-string for ≤63 bytes), `Binary()`, `UUID()`
   - Container construction: `Offset()` / `NextElement()` / `FinishArray()` for 
arrays, `NextField()` / `FinishObject()` for objects
   - `Finish()` — produces encoded metadata + value buffers with sorted-flag 
detection
   - `Reset()` — clears buffer for builder reuse; dictionary preserved across 
`Finish()` calls
   - Constructor from existing `VariantMetadata` for shared-dictionary workflows
   
   **Key design points:**
   - Move-only (non-copyable, `noexcept` movable)
   - `FinishObject()` sorts fields in-place by key — spec requires field IDs in 
lexicographic key order
   - Strict duplicate key rejection (`Status::Invalid`) — spec says "An object 
may not contain duplicate keys"; configurable tolerance deferred to GH-45948 
with TODO
   - `FinishArray()` validates offsets are non-negative
   - `Finish()` validates total dictionary size fits in 4-byte offsets
   - Decimal scale validation (≤ 38) in encoder; decoder is lenient
   - Go enforces a 128MB metadata limit (`metadataMaxSizeLimit`); C++ only 
enforces the spec's ~4GB 4-byte offset maximum
   
   **TODOs for GH-45948 (shredding):**
   ```cpp
   // TODO GH-45948: Add BuildWithoutMeta() — raw value bytes without metadata
   // TODO GH-45948: Add UnsafeAppendEncoded() — append pre-encoded bytes
   // TODO GH-45948: Add SetAllowDuplicates(bool) — last-value-wins semantics
   ```
   
   ### Are these changes tested?
   Yes. 238 total tests pass with `BUILD_WARNING_LEVEL=CHECKIN` (73 encoder + 
165 decoder):
   - Primitive round-trips (14 tests including short/long string boundary at 
63/64 bytes)
   - Int auto-sizing boundaries: Int8→Int16→Int32→Int64 transitions (8 tests)
   - Direct int type methods: `Int8/16/32/64` without auto-sizing (4 tests)
   - Array round-trips: empty, simple, nested (3 tests)
   - Object round-trips: empty, simple, nested, duplicate rejection, field 
sorting (5 tests)
   - Builder features: reset, from-existing-metadata, sorted/unsorted flag (4 
tests)
   - Integration: complex nested object, large metadata (300 keys), offset-size 
computation, invalid start, negative offsets (5 tests)
   - Special floats: NaN, ±Inf for float and double (6 tests)
   - Large containers triggering `is_large` flag: 300-element array + 300-field 
object (2 tests)
   - Decoder utility round-trips through builder output: FindObjectField, 
GetArrayElement, GetObjectFieldAt, ValueSize (4 tests)
   - Builder reuse: dictionary preservation across multiple `Finish()` calls (2 
tests)
   - Pre-existing buffer: FinishObject/FinishArray with start > 0 (2 tests)
   - Decimal scale validation: rejects scale > 38 (1 test)
   
   ### Are there any user-facing changes?
   No breaking changes. This extends the public API added in GH-45946 with the 
`VariantBuilder` class in the same `arrow::extension::variant` namespace.
   
   **AI Disclosure:** AI coding assistants were used during development for 
scaffolding, test generation, and review iteration. All code has been reviewed, 
debugged, and verified by the author who owns and understands the changes.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to