Hi Wenbo, Sorry I wasn't clear enough. There are two issues with the code snippet.
As Aldrin pointed out, you are working with uint8_t*, which I don't think is right. The buffer stores bytes, not pointers. auto *out_values = out->array_span_mutable()->GetValues<uint8_t*>(1); > It should be GetValues<uint8_t>(1), not GetValues<uint8_t*>(1) memcpy(*out++, hash, 32); > It should be memcpy(out, hash, 32); out += 32; Also, I see that you are not handling nulls at all. I assume you want to propagate nulls from the input. In that case you'll want to set kernel.null_handling=NullHandling::INTERSECTION, so that the executor will handle the validity bitmap for you. Best, Jin On Mon, Jul 17, 2023 at 10:41 PM Weston Pace <[email protected]> wrote: > > I may be missing something, but why copy to *out_values++ instead of > > *out_values and add 32 to out_values afterwards? Otherwise I agree this > is > > the way to go. > > I agree with Jin. You should probably be incrementing `out` by 32 each > time `VisitValue` is called. > > On Mon, Jul 17, 2023 at 6:38 AM Aldrin <[email protected]> wrote: > > > Oh wait, I see now that you're incrementing with a uint8_t*. That could > be > > fine for your own use, but you might want to make sure it aligns with the > > type of your output (Int64Array vs Int32Array). > > > > Sent from Proton Mail for iOS > > > > > > On Mon, Jul 17, 2023 at 06:20, Aldrin <[email protected] > > <On+Mon,+Jul+17,+2023+at+06:20,+Aldrin+%3C%3Ca+href=>> wrote: > > > > Hi Wenbo, > > > > An ArraySpan is like an ArrayData but does not own the data, so the > > ColumnarFormat doc that Jon shared is relevant for both. > > > > In the case of a binary format, the output ArraySpan must have at least 2 > > buffers: the offsets and the contiguous binary data (values). If the > output > > of your UDF is something like an Int32Array with no nulls, then I think > > you're writing output correctly. > > > > But, since your pointer is a uint8_t, I think Jin is right and `++` is > > going to move your pointer 1 byte instead of 32 bytes like you intend. > > > > Sent from Proton Mail for iOS > > > > > > On Mon, Jul 17, 2023 at 05:06, Wenbo Hu <[email protected] > > <On+Mon,+Jul+17,+2023+at+05:06,+Wenbo+Hu+%3C%3Ca+href=>> wrote: > > > > Hi Jin, > > > > > but why copy to *out_values++ instead of > > > *out_values and add 32 to out_values afterwards? > > I'm implementing the sha256 function as a scalar function, but it > > always inputs with an array, so on visitor pattern, I'll write a 32 > > byte hash into the pointer and move to the next for next visit. > > Something like: > > ``` > > > > struct BinarySha256Visitor { > > BinarySha256Visitor(uint8_t **out) { > > this->out = out; > > } > > arrow::Status VisitNull() { > > return arrow::Status::OK(); > > } > > > > arrow::Status VisitValue(std::string_view v) { > > > > uint8_t hash[32]; > > sha256(v, hash); > > > > memcpy(*out++, hash, 32); > > > > return arrow::Status::OK(); > > } > > > > uint8_t ** out; > > }; > > > > arrow::Status Sha256Func(cp::KernelContext *ctx, const cp::ExecSpan > > &batch, cp::ExecResult *out) { > > arrow::ArraySpanVisitor<arrow::BinaryType> visitor; > > > > auto *out_values = out->array_span_mutable()->GetValues<uint8_t*>(1); > > BinarySha256Visitor visit(out_values); > > ARROW_RETURN_NOT_OK(visitor.Visit(batch[0].array, &visit)); > > > > return arrow::Status::OK(); > > } > > ``` > > Is it as expected? > > > > Jin Shang <[email protected]> 于2023年7月17日周一 19:44写道: > > > > > > Hi Wenbo, > > > > > > I'd like to known what's the *three* `buffers` are in ArraySpan. What > are > > > > `1` means when `GetValues` called? > > > > > > The meaning of buffers in an ArraySpan depends on the layout of its > data > > > type. FixedSizeBinary is a fixed-size primitive type, so it has two > > > buffers, one validity buffer and one data buffer. So GetValues(1) would > > > return a pointer to the data buffer. > > > Layouts of data types can be found here[1]. > > > > > > what is the actual type should I get from `GetValues`? > > > > > > > Buffer data is stored as raw bytes (uint8_t) but can be reinterpreted > as > > > any type to suit your need. The template parameter for GetValue is > simply > > > forwarded to reinterpret_cast. There are discussions[2] on the > soundness > > of > > > using uint8_t to represent bytes but it is what we use now. Since you > are > > > only doing a memcpy, uint8_t should be good. > > > > > > Maybe, `auto *out_values = out->array_span_mutable()->GetValues(uint8_t > > > > *>(1);` and `memcpy(*out_values++, some_ptr, 32);`? > > > > > > > I may be missing something, but why copy to *out_values++ instead of > > > *out_values and add 32 to out_values afterwards? Otherwise I agree this > > is > > > the way to go. > > > > > > [1] > > > > > > https://arrow.apache.org/docs/format/Columnar.html#buffer-listing-for-each-layout > > > [2] https://github.com/apache/arrow/issues/36123 > > > > > > > > > On Mon, Jul 17, 2023 at 4:44 PM Wenbo Hu <[email protected]> > wrote: > > > > > > > Hi, > > > > I'm using Acero as the stream executor to run large scale data > > > > transformation. The core data used in UDF is `ArraySpan` in > > > > `ExecSpan`, but not much document on ArraySpan. I'd like to known > > > > what's the *three* `buffers` are in ArraySpan. What are `1` means > when > > > > `GetValues` called? > > > > For input data, I can use a `ArraySpanVisitor` to iterator over > > > > different input types. But for output data, I don't know how to write > > > > to the`array_span_mutable()` if it is not a simple c_type. > > > > For example, I'm implementing a sha256 udf, which input is > > > > `arrow::utf8()` and the output is `arrow::fixed_size_binary(32)`, > then > > > > how can I directly write to the out buffers and what is the actual > > > > type should I get from `GetValues`? > > > > Maybe, `auto *out_values = > > > > out->array_span_mutable()->GetValues(uint8_t *>(1);` and > > > > `memcpy(*out_values++, some_ptr, 32);`? > > > > > > > > -- > > > > --------------------- > > > > Best Regards, > > > > Wenbo Hu, > > > > > > > > > > > > -- > > --------------------- > > Best Regards, > > Wenbo Hu, > > > > >
