Re: Need help on ArrayaSpan and writing C++ udf

Weston Pace Mon, 17 Jul 2023 07:40:59 -0700

> I may be missing something, but why copy to *out_values++ instead of
> *out_values and add 32 to out_values afterwards? Otherwise I agree this is
> the way to go.


I agree with Jin.  You should probably be incrementing `out` by 32 each
time `VisitValue` is called.

On Mon, Jul 17, 2023 at 6:38 AM Aldrin <[email protected]> wrote:

> Oh wait, I see now that you're incrementing with a uint8_t*. That could be
> fine for your own use, but you might want to make sure it aligns with the
> type of your output (Int64Array vs Int32Array).
>
> Sent from Proton Mail for iOS
>
>
> On Mon, Jul 17, 2023 at 06:20, Aldrin <[email protected]
> <On+Mon,+Jul+17,+2023+at+06:20,+Aldrin+%3C%3Ca+href=>> wrote:
>
> Hi Wenbo,
>
> An ArraySpan is like an ArrayData but does not own the data, so the
> ColumnarFormat doc that Jon shared is relevant for both.
>
> In the case of a binary format, the output ArraySpan must have at least 2
> buffers: the offsets and the contiguous binary data (values). If the output
> of your UDF is something like an Int32Array with no nulls, then I think
> you're writing output correctly.
>
> But, since your pointer is a uint8_t, I think Jin is right and `++` is
> going to move your pointer 1 byte instead of 32 bytes like you intend.
>
> Sent from Proton Mail for iOS
>
>
> On Mon, Jul 17, 2023 at 05:06, Wenbo Hu <[email protected]
> <On+Mon,+Jul+17,+2023+at+05:06,+Wenbo+Hu+%3C%3Ca+href=>> wrote:
>
> Hi Jin,
>
> > but why copy to *out_values++ instead of
> > *out_values and add 32 to out_values afterwards?
> I'm implementing the sha256 function as a scalar function, but it
> always inputs with an array, so on visitor pattern, I'll write a 32
> byte hash into the pointer and move to the next for next visit.
> Something like:
> ```
>
> struct BinarySha256Visitor {
> BinarySha256Visitor(uint8_t **out) {
> this->out = out;
> }
> arrow::Status VisitNull() {
> return arrow::Status::OK();
> }
>
> arrow::Status VisitValue(std::string_view v) {
>
> uint8_t hash[32];
> sha256(v, hash);
>
> memcpy(*out++, hash, 32);
>
> return arrow::Status::OK();
> }
>
> uint8_t ** out;
> };
>
> arrow::Status Sha256Func(cp::KernelContext *ctx, const cp::ExecSpan
> &batch, cp::ExecResult *out) {
> arrow::ArraySpanVisitor<arrow::BinaryType> visitor;
>
> auto *out_values = out->array_span_mutable()->GetValues<uint8_t*>(1);
> BinarySha256Visitor visit(out_values);
> ARROW_RETURN_NOT_OK(visitor.Visit(batch[0].array, &visit));
>
> return arrow::Status::OK();
> }
> ```
> Is it as expected?
>
> Jin Shang <[email protected]> 于2023年7月17日周一 19:44写道：
> >
> > Hi Wenbo,
> >
> > I'd like to known what's the *three* `buffers` are in ArraySpan. What are
> > > `1` means when `GetValues` called?
> >
> > The meaning of buffers in an ArraySpan depends on the layout of its data
> > type. FixedSizeBinary is a fixed-size primitive type, so it has two
> > buffers, one validity buffer and one data buffer. So GetValues(1) would
> > return a pointer to the data buffer.
> > Layouts of data types can be found here[1].
> >
> > what is the actual type should I get from `GetValues`?
> > >
> > Buffer data is stored as raw bytes (uint8_t) but can be reinterpreted as
> > any type to suit your need. The template parameter for GetValue is simply
> > forwarded to reinterpret_cast. There are discussions[2] on the soundness
> of
> > using uint8_t to represent bytes but it is what we use now. Since you are
> > only doing a memcpy, uint8_t should be good.
> >
> > Maybe, `auto *out_values = out->array_span_mutable()->GetValues(uint8_t
> > > *>(1);` and `memcpy(*out_values++, some_ptr, 32);`?
> > >
> > I may be missing something, but why copy to *out_values++ instead of
> > *out_values and add 32 to out_values afterwards? Otherwise I agree this
> is
> > the way to go.
> >
> > [1]
> >
> https://arrow.apache.org/docs/format/Columnar.html#buffer-listing-for-each-layout
> > [2] https://github.com/apache/arrow/issues/36123
> >
> >
> > On Mon, Jul 17, 2023 at 4:44 PM Wenbo Hu <[email protected]> wrote:
> >
> > > Hi,
> > > I'm using Acero as the stream executor to run large scale data
> > > transformation. The core data used in UDF is `ArraySpan` in
> > > `ExecSpan`, but not much document on ArraySpan. I'd like to known
> > > what's the *three* `buffers` are in ArraySpan. What are `1` means when
> > > `GetValues` called?
> > > For input data, I can use a `ArraySpanVisitor` to iterator over
> > > different input types. But for output data, I don't know how to write
> > > to the`array_span_mutable()` if it is not a simple c_type.
> > > For example, I'm implementing a sha256 udf, which input is
> > > `arrow::utf8()` and the output is `arrow::fixed_size_binary(32)`, then
> > > how can I directly write to the out buffers and what is the actual
> > > type should I get from `GetValues`?
> > > Maybe, `auto *out_values =
> > > out->array_span_mutable()->GetValues(uint8_t *>(1);` and
> > > `memcpy(*out_values++, some_ptr, 32);`?
> > >
> > > --
> > > ---------------------
> > > Best Regards,
> > > Wenbo Hu,
> > >
>
>
>
> --
> ---------------------
> Best Regards,
> Wenbo Hu,
>
>

Re: Need help on ArrayaSpan and writing C++ udf

Reply via email to