arashandishgar commented on issue #44615:
URL: https://github.com/apache/arrow/issues/44615#issuecomment-2664750084
> > 2.My question is what you expect to return for extract_regex_span? My
suggestion is a struct which includes a fixed_size_list type for each group
with length of two for each element and it contains an offset as the the first
value and the length as the second value of each element. for example for the
regex`"(?P<letter>[ab])(?P<digit>\\d)"`the struct array will be ` auto type =
struct_({field("letter_span", fixed_size_list(int64(), 2)), field("digit_span",
fixed_size_list(int64(), 2))});`
>
> I would suggest the following if the input is a regular string array (i.e.
with 32-bit offsets):
>
> auto type = struct_({
> field("letter", struct_({field("start", int32()), field("length",
int32())})),
> field("digit", struct_({field("start", int32()), field("length",
int32())}))
> });
> and the following if the input is a large string array (i.e. with 64-bit
offsets):
>
> auto type = struct_({
> field("letter", struct_({field("start", int64()), field("length",
int64())})),
> field("digit", struct_({field("start", int64()), field("length",
int64())}))
> });
> But I'm open to other suggestions as well. cc
[@jorisvandenbossche](https://github.com/jorisvandenbossche)
[@zanmato1984](https://github.com/zanmato1984) for ideas.
>
I have a question regarding what offset should be used for storing?
According to this
[link](https://github.com/google/re2/issues/24#issuecomment-97653183) the
offset can be achieved by pointer arithmetic. Is it safe to store in 32-bit
space for a regular string array?
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]