Hi Wenbo,

An ArraySpan is like an ArrayData but does not own the data, so the ColumnarFormat doc that Jon shared is relevant for both.

In the case of a binary format, the output ArraySpan must have at least 2 buffers: the offsets and the contiguous binary data (values). If the output of your UDF is something like an Int32Array with no nulls, then I think you're writing output correctly.

But, since your pointer is a uint8_t, I think Jin is right and `++` is going to move your pointer 1 byte instead of 32 bytes like you intend.

Sent from Proton Mail for iOS


On Mon, Jul 17, 2023 at 05:06, Wenbo Hu <huwenbo1...@gmail.com> wrote:
Hi Jin,

> but why copy to *out_values++ instead of
> *out_values and add 32 to out_values afterwards?
I'm implementing the sha256 function as a scalar function, but it
always inputs with an array, so on visitor pattern, I'll write a 32
byte hash into the pointer and move to the next for next visit.
Something like:
```

struct BinarySha256Visitor {
BinarySha256Visitor(uint8_t **out) {
this->out = out;
}
arrow::Status VisitNull() {
return arrow::Status::OK();
}

arrow::Status VisitValue(std::string_view v) {

uint8_t hash[32];
sha256(v, hash);

memcpy(*out++, hash, 32);

return arrow::Status::OK();
}

uint8_t ** out;
};

arrow::Status Sha256Func(cp::KernelContext *ctx, const cp::ExecSpan
&batch, cp::ExecResult *out) {
arrow::ArraySpanVisitor<arrow::BinaryType> visitor;

auto *out_values = out->array_span_mutable()->GetValues<uint8_t*>(1);
BinarySha256Visitor visit(out_values);
ARROW_RETURN_NOT_OK(visitor.Visit(batch[0].array, &visit));

return arrow::Status::OK();
}
```
Is it as expected?

Jin Shang <shangjin1...@gmail.com> 于2023年7月17日周一 19:44写道:
>
> Hi Wenbo,
>
> I'd like to known what's the *three* `buffers` are in ArraySpan. What are
> > `1` means when `GetValues` called?
>
> The meaning of buffers in an ArraySpan depends on the layout of its data
> type. FixedSizeBinary is a fixed-size primitive type, so it has two
> buffers, one validity buffer and one data buffer. So GetValues(1) would
> return a pointer to the data buffer.
> Layouts of data types can be found here[1].
>
> what is the actual type should I get from `GetValues`?
> >
> Buffer data is stored as raw bytes (uint8_t) but can be reinterpreted as
> any type to suit your need. The template parameter for GetValue is simply
> forwarded to reinterpret_cast. There are discussions[2] on the soundness of
> using uint8_t to represent bytes but it is what we use now. Since you are
> only doing a memcpy, uint8_t should be good.
>
> Maybe, `auto *out_values = out->array_span_mutable()->GetValues(uint8_t
> > *>(1);` and `memcpy(*out_values++, some_ptr, 32);`?
> >
> I may be missing something, but why copy to *out_values++ instead of
> *out_values and add 32 to out_values afterwards? Otherwise I agree this is
> the way to go.
>
> [1]
> https://arrow.apache.org/docs/format/Columnar.html#buffer-listing-for-each-layout
> [2] https://github.com/apache/arrow/issues/36123
>
>
> On Mon, Jul 17, 2023 at 4:44 PM Wenbo Hu <huwenbo1...@gmail.com> wrote:
>
> > Hi,
> > I'm using Acero as the stream executor to run large scale data
> > transformation. The core data used in UDF is `ArraySpan` in
> > `ExecSpan`, but not much document on ArraySpan. I'd like to known
> > what's the *three* `buffers` are in ArraySpan. What are `1` means when
> > `GetValues` called?
> > For input data, I can use a `ArraySpanVisitor` to iterator over
> > different input types. But for output data, I don't know how to write
> > to the`array_span_mutable()` if it is not a simple c_type.
> > For example, I'm implementing a sha256 udf, which input is
> > `arrow::utf8()` and the output is `arrow::fixed_size_binary(32)`, then
> > how can I directly write to the out buffers and what is the actual
> > type should I get from `GetValues`?
> > Maybe, `auto *out_values =
> > out->array_span_mutable()->GetValues(uint8_t *>(1);` and
> > `memcpy(*out_values++, some_ptr, 32);`?
> >
> > --
> > ---------------------
> > Best Regards,
> > Wenbo Hu,
> >



--
---------------------
Best Regards,
Wenbo Hu,

Attachment: publicKey - octalene.dev@pm.me - 0x21969656.asc
Description: application/pgp-keys

Attachment: signature.asc
Description: OpenPGP digital signature

Reply via email to