> I may be missing something, but why copy to *out_values++ instead of > *out_values and add 32 to out_values afterwards? Otherwise I agree this is > the way to go.
I agree with Jin. You should probably be incrementing `out` by 32 each time `VisitValue` is called. On Mon, Jul 17, 2023 at 6:38 AM Aldrin <octalene....@pm.me.invalid> wrote: > Oh wait, I see now that you're incrementing with a uint8_t*. That could be > fine for your own use, but you might want to make sure it aligns with the > type of your output (Int64Array vs Int32Array). > > Sent from Proton Mail for iOS > > > On Mon, Jul 17, 2023 at 06:20, Aldrin <octalene....@pm.me.INVALID > <On+Mon,+Jul+17,+2023+at+06:20,+Aldrin+%3C%3Ca+href=>> wrote: > > Hi Wenbo, > > An ArraySpan is like an ArrayData but does not own the data, so the > ColumnarFormat doc that Jon shared is relevant for both. > > In the case of a binary format, the output ArraySpan must have at least 2 > buffers: the offsets and the contiguous binary data (values). If the output > of your UDF is something like an Int32Array with no nulls, then I think > you're writing output correctly. > > But, since your pointer is a uint8_t, I think Jin is right and `++` is > going to move your pointer 1 byte instead of 32 bytes like you intend. > > Sent from Proton Mail for iOS > > > On Mon, Jul 17, 2023 at 05:06, Wenbo Hu <huwenbo1...@gmail.com > <On+Mon,+Jul+17,+2023+at+05:06,+Wenbo+Hu+%3C%3Ca+href=>> wrote: > > Hi Jin, > > > but why copy to *out_values++ instead of > > *out_values and add 32 to out_values afterwards? > I'm implementing the sha256 function as a scalar function, but it > always inputs with an array, so on visitor pattern, I'll write a 32 > byte hash into the pointer and move to the next for next visit. > Something like: > ``` > > struct BinarySha256Visitor { > BinarySha256Visitor(uint8_t **out) { > this->out = out; > } > arrow::Status VisitNull() { > return arrow::Status::OK(); > } > > arrow::Status VisitValue(std::string_view v) { > > uint8_t hash[32]; > sha256(v, hash); > > memcpy(*out++, hash, 32); > > return arrow::Status::OK(); > } > > uint8_t ** out; > }; > > arrow::Status Sha256Func(cp::KernelContext *ctx, const cp::ExecSpan > &batch, cp::ExecResult *out) { > arrow::ArraySpanVisitor<arrow::BinaryType> visitor; > > auto *out_values = out->array_span_mutable()->GetValues<uint8_t*>(1); > BinarySha256Visitor visit(out_values); > ARROW_RETURN_NOT_OK(visitor.Visit(batch[0].array, &visit)); > > return arrow::Status::OK(); > } > ``` > Is it as expected? > > Jin Shang <shangjin1...@gmail.com> 于2023年7月17日周一 19:44写道: > > > > Hi Wenbo, > > > > I'd like to known what's the *three* `buffers` are in ArraySpan. What are > > > `1` means when `GetValues` called? > > > > The meaning of buffers in an ArraySpan depends on the layout of its data > > type. FixedSizeBinary is a fixed-size primitive type, so it has two > > buffers, one validity buffer and one data buffer. So GetValues(1) would > > return a pointer to the data buffer. > > Layouts of data types can be found here[1]. > > > > what is the actual type should I get from `GetValues`? > > > > > Buffer data is stored as raw bytes (uint8_t) but can be reinterpreted as > > any type to suit your need. The template parameter for GetValue is simply > > forwarded to reinterpret_cast. There are discussions[2] on the soundness > of > > using uint8_t to represent bytes but it is what we use now. Since you are > > only doing a memcpy, uint8_t should be good. > > > > Maybe, `auto *out_values = out->array_span_mutable()->GetValues(uint8_t > > > *>(1);` and `memcpy(*out_values++, some_ptr, 32);`? > > > > > I may be missing something, but why copy to *out_values++ instead of > > *out_values and add 32 to out_values afterwards? Otherwise I agree this > is > > the way to go. > > > > [1] > > > https://arrow.apache.org/docs/format/Columnar.html#buffer-listing-for-each-layout > > [2] https://github.com/apache/arrow/issues/36123 > > > > > > On Mon, Jul 17, 2023 at 4:44 PM Wenbo Hu <huwenbo1...@gmail.com> wrote: > > > > > Hi, > > > I'm using Acero as the stream executor to run large scale data > > > transformation. The core data used in UDF is `ArraySpan` in > > > `ExecSpan`, but not much document on ArraySpan. I'd like to known > > > what's the *three* `buffers` are in ArraySpan. What are `1` means when > > > `GetValues` called? > > > For input data, I can use a `ArraySpanVisitor` to iterator over > > > different input types. But for output data, I don't know how to write > > > to the`array_span_mutable()` if it is not a simple c_type. > > > For example, I'm implementing a sha256 udf, which input is > > > `arrow::utf8()` and the output is `arrow::fixed_size_binary(32)`, then > > > how can I directly write to the out buffers and what is the actual > > > type should I get from `GetValues`? > > > Maybe, `auto *out_values = > > > out->array_span_mutable()->GetValues(uint8_t *>(1);` and > > > `memcpy(*out_values++, some_ptr, 32);`? > > > > > > -- > > > --------------------- > > > Best Regards, > > > Wenbo Hu, > > > > > > > -- > --------------------- > Best Regards, > Wenbo Hu, > >