github-actions[bot] commented on code in PR #64060:
URL: https://github.com/apache/doris/pull/64060#discussion_r3346678153


##########
be/src/format/arrow/arrow_block_convertor.cpp:
##########
@@ -80,8 +80,15 @@ Status 
FromBlockToRecordBatchConverter::convert(std::shared_ptr<arrow::RecordBat
         _cur_type = _block.get_by_position(idx).type;
         auto column = _cur_col->convert_to_full_column_if_const();
         auto arrow_type = _schema->field(idx)->type();
-        if (arrow_type->name() == "utf8" && column->byte_size() >= 
MAX_ARROW_UTF8) {
-            arrow_type = arrow::large_utf8();
+        if (arrow_type->id() == arrow::Type::STRING) {
+            const auto column_byte_size = column->byte_size();

Review Comment:
   This guard rejects valid UTF8 batches because `column->byte_size()` is not 
Arrow's UTF8 value-buffer size. For a normal `ColumnString`, `byte_size()` is 
`chars.size() + offsets.size() * sizeof(offsets[0])`, and for 
`Nullable(String)` it also includes the null map via 
`ColumnNullable::byte_size()`. Arrow's string limit is enforced on the 32-bit 
value offsets/value data, not Doris' offset/null-map overhead. A batch with 
slightly less than 2 GiB of string payload plus enough rows to add 
offsets/null-map bytes will now return `INVALID_ARGUMENT` even though 
`arrow::StringBuilder` could encode it. The same false rejection can happen 
with the row-range overload: a selected slice may be under the limit while the 
full column `byte_size()` is over it. Please base the check on the selected 
rows' string payload bytes, or let `StringBuilder` fail and convert that Arrow 
status.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to