arashandishgar opened a new issue, #46128:
URL: https://github.com/apache/arrow/issues/46128
### Describe the enhancement requested
The current compute function for casting from other String Types to
StringView Types can be optimized in five field. I'm going to elaborate on each
of them.
<details>
<summary>1-Slicing </summary>
Running the following code
```
#define LOG(...) ARROW_LOGGER_INFO("", __VA_ARGS__)
template <typename I, typename O>
void CheckCast(int string_length, int offset, int offset_lengt) {
using Builder = typename TypeTraits<I>::BuilderType;
using ArrayType = typename TypeTraits<O>::ArrayType;
Builder builder;
for (int i = 0; i < string_length; i++) {
if (i % 3 == 0) {
ASSERT_OK(builder.AppendNull());
} else {
ASSERT_OK(builder.Append(std::string(4, 'a')));
}
}
std::shared_ptr<Array> input;
ASSERT_OK(builder.Finish(&input));
CastOptions options;
options.to_type = TypeTraits<O>::type_singleton();
ASSERT_OK_AND_ASSIGN(
auto datum, CallFunction("cast", {input->Slice(offset, offset_lengt)},
&options));
auto output = datum.array_as<ArrayType>();
auto data = output->data();
LOG("Size:", data->buffers[1]->size());
LOG("Offset:", data->offset);
LOG("Length:", data->length);
}
TEST(Test, CastStringToStringView) {
CheckCast<StringType, StringViewType>((1<<10)+1, 1<<10, 1);
}
```
will print the following similar
massage
```
/home/arashandishgar/Desktop/arrow/cpp/src/arrow/compute/kernels/scalar_cast_test.cc:3289:
Size:16400
/home/arashandishgar/Desktop/arrow/cpp/src/arrow/compute/kernels/scalar_cast_test.cc:3290:
Offset:1024
/home/arashandishgar/Desktop/arrow/cpp/src/arrow/compute/kernels/scalar_cast_test.cc:3291:
Length:1
```
In other words, the compute function allocates 1024 elements for the view
buffer for only one element. ( I think the main focus of the author of this
function was to prevent copying of bitmap, However, in the slicing case it's
better the bitmap buffer is copied)
I've already changed the function to prevent extra allocation and I'd like
to send a pull request to solve it. (the same scenario exists for casting from
FixedSizeBinary to String/Binary View too).
</details>
<details>
<summary>2-Checking Overflow</summary>
The current implementation for casting to string view check overflowing
after allocating buffers
https://github.com/apache/arrow/blob/efaef48db06767d40dc0c08998a8b3fd25f2cbeb/cpp/src/arrow/compute/kernels/scalar_cast_string.cc#L397-L412
However, it's possible to check for overflow before validating UTF-8, as I
believe there's no need to allocate a buffer if an overflow occurs.
Additionally, there's no overflow check when casting from StringView types to
String types, which could allow a UTF-8 string to exceed the maximum int32_t
length.
https://github.com/apache/arrow/blob/efaef48db06767d40dc0c08998a8b3fd25f2cbeb/cpp/src/arrow/compute/kernels/scalar_cast_string.cc#L344-L346
For example
```
#define LOG(...) ARROW_LOGGER_INFO("", __VA_ARGS__)
template <typename I, typename O>
void CheckCast(int string_length, int string_width,int offset, int
offset_lengt) {
using Builder = typename TypeTraits<I>::BuilderType;
using ArrayType = typename TypeTraits<O>::ArrayType;
Builder builder;
for (int i = 0; i < string_length; i++) {
ASSERT_OK(builder.Append(std::string(string_width, 'a')));
}
std::shared_ptr<Array> input;
ASSERT_OK(builder.Finish(&input));
CastOptions options;
options.to_type = TypeTraits<O>::type_singleton();
ASSERT_OK_AND_ASSIGN(
auto datum, CallFunction("cast", {input->Slice(offset, offset_lengt)},
&options));
auto output = datum.array_as<ArrayType>();
auto data = output->data();
LOG("Size:", data->buffers[1]->size());
LOG("Size:", data->buffers[2]->size());
LOG("Offset:", data->offset);
LOG("Length:", data->length);
}
TEST(Test, CastStringToStringView) {
CheckCast<StringViewType, StringType>(8, 1<<30,0,1<<30);
}
```
and the massage is
```
/home/arashandishgar/Desktop/arrow/cpp/src/arrow/compute/kernels/scalar_cast_test.cc:3285:
Size:36
/home/arashandishgar/Desktop/arrow/cpp/src/arrow/compute/kernels/scalar_cast_test.cc:3286:
Size:8589934592
/home/arashandishgar/Desktop/arrow/cpp/src/arrow/compute/kernels/scalar_cast_test.cc:3287:
Offset:0
/home/arashandishgar/Desktop/arrow/cpp/src/arrow/compute/kernels/scalar_cast_test.cc:3288:
Length:8
```
</details>
<details>
<summary>3-Large Strings</summary>
I'm not sure whether the following behaviour should be called bug(Although
the code mentioned in its commeent as unpractical case)
https://github.com/apache/arrow/blob/efaef48db06767d40dc0c08998a8b3fd25f2cbeb/cpp/src/arrow/compute/kernels/scalar_cast_string.cc#L397-L411
The above logic prevents casting scenarios, such as 8 GB of data in a
LargeString type (where each element's length is less than the maximum
int32_t), to a StringView. I believe a simple implementation could be feasible,
though I recognize it would overlook the key benefit of StringView types, which
is to create a view of a data buffer while avoiding large data copies.
```
template <typename O, typename I>
enable_if_t<is_base_binary_type<I>::value &&
is_binary_view_like_type<O>::value,
Result<std::shared_ptr<ArrayData>>>
AddStringToStringView(const ArraySpan& input) {
using Builder = typename TypeTraits<O>::BuilderType;
Builder builder;
ARROW_RETURN_NOT_OK(VisitArraySpanInline<I>(
input, [&](std::string_view s) { return builder.Append(s); },
[&]() { return builder.AppendNull(); }));
std::shared_ptr<Array> array_out;
RETURN_NOT_OK(builder.Finish(&array_out));
return array_out->data();
}
```
</details>
<details>
<summary>4-Memory Bloat</summary>
https://github.com/apache/arrow/blob/efaef48db06767d40dc0c08998a8b3fd25f2cbeb/cpp/src/arrow/compute/kernels/scalar_cast_string.cc#L418-L441
The implementation above may lead to a state where it owns a buffer but
utilizes only a portion of it. Is it worthwhile to create a garbage collector
and transfer the data to a new buffer?
</details>
<details>
<summary>5-dedublicate</summary>
As mentioned in [1] (Section 4), a key feature of the String/Binary View is
preventing duplication. I believe this approach could be useful when converting
from Fixed to String/Binary View types to minimize memory usage.
</details>
1-
https://www.influxdata.com/blog/faster-queries-with-stringview-part-two-influxdb/
### Component(s)
C++
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]