zeroshade opened a new pull request, #802:
URL: https://github.com/apache/arrow-go/pull/802
### Rationale for this change
Fixes #184. `compute.CastArray` currently fails with `unsupported cast to
string_view from utf8` (and symmetrically for `binary_view`, inverse
directions, and FixedSizeBinary) because no cast kernels are registered for the
view variants. PyArrow has supported this since 18.0.0 (apache/arrow#43302);
this PR brings arrow-go to parity.
### What changes are included in this PR?
**New cast kernels** in
`arrow/compute/internal/kernels/binary_view_casts.go`:
| Direction | Kernel | Notes |
|---|---|---|
| `binary`, `large_binary`, `string`, `large_string`, `fixed_size_binary` →
`binary_view`/`string_view` | `CastBinaryToBinaryView` | Uses
`BinaryViewBuilder`/`StringViewBuilder`. UTF-8 validated when destination is
`string_view` and source is non-utf8, unless `CastOptions.AllowInvalidUtf8` is
set. |
| `binary_view`, `string_view` → `binary`, `large_binary`, `string`,
`large_string` | `CastBinaryViewToBinary[OutOffsetT]` | Materializes into a
contiguous data buffer. UTF-8 validated when casting `binary_view` →
`string`/`large_string`, unless `AllowInvalidUtf8` is set. Checks `int32`
overflow when targeting `string`/`binary`. |
| `binary_view` ↔ `string_view` (and identity) |
`CastBinaryViewToBinaryView` | Zero-copy via `ZeroCopyCastExec`; UTF-8
validated on `binary_view` → `string_view` direction. |
These are wired into the dispatch table via `cast_binary_view` and
`cast_string_view` entries in `cast.go`, and `addToBinaryKernels` now also
registers `view → base-binary` kernels so the existing
`cast_binary`/`cast_string`/etc. paths accept view inputs.
**Related executor/array fixes** (required to get view types through the
generic dispatch):
* `arrow/compute/exec.getNumBuffers` now returns 3 for
`binary_view`/`string_view` (bitmap + view headers + one overflow data buffer),
matching `ArraySpan`'s fixed `[3]BufferSpan`. Previously this fell through to
the `default: return 2` case, so the data buffer was dropped when
`ArraySpan.MakeData` sliced to `bufs[:NumBuffers()]`, which crashed any cast
that read non-inline values via `input.MakeArray()`.
* `arrow/array.getMaxBufferLen` and the `nullArrayFactory` now handle
`BinaryViewDataType`, so `MakeArrayOfNull` works for view outputs. Without
this, casting a zero-length input array to a view type panicked with
`arrayofnull not implemented for type string_view`.
### Are these changes tested?
Yes. Added four new `CastSuite` tests in `arrow/compute/cast_test.go`:
* `TestBinaryLikeToBinaryView` - all base-binary → view, FSB → view, short
(inline) + long (out-of-line) values + nulls, UTF-8 validation + opt-out via
`AllowInvalidUtf8`.
* `TestBinaryViewToBinaryLike` - view → all base-binary, UTF-8 validation
for `binary_view` → `string`.
* `TestBinaryViewToBinaryView` - cross-view casts with UTF-8 validation.
* `TestCanCastViewTypes` - `CanCast` coverage for the new paths.
These use a simple array-level comparison (`checkCastArrayOnly`) rather than
the existing `checkCast` because the latter iterates via `scalar.GetScalar`,
which does not yet support view types (separate limitation, outside scope).
Full `arrow/compute/...` and `arrow/array/...` test suites pass;
`arrow/compute` also passes with `-race`.
The exact reproducer from #184 now succeeds:
```
input: ["a" "b" "c" "this is a longer string beyond 12 bytes"]
cast to string_view: ["a" "b" "c" "this is a longer string beyond 12 bytes"]
type: string_view
```
### Are there any user-facing changes?
Yes - previously-unsupported cast paths now work:
* `compute.CastArray(ctx, arr, {ToType: BinaryTypes.StringView})` and
inverse.
* `compute.CastArray(ctx, arr, {ToType: BinaryTypes.BinaryView})` and
inverse.
* `binary_view` ↔ `string_view` casts.
* FSB → `binary_view`/`string_view`.
* `compute.CanCast` advertises all of the above.
No existing behavior changes. The change to `getNumBuffers` could in
principle affect other compute paths that manipulate view arrays, but the
previous value of `2` was incorrect for views (causing data-buffer loss), so
this is a bug fix rather than a behavioral change.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]