Re: [PR] [fix](iceberg) Write binary columns with proper Arrow types [doris]

via GitHub Mon, 29 Jun 2026 03:15:58 -0700


github-actions[bot] commented on code in PR #64949:
URL: https://github.com/apache/doris/pull/64949#discussion_r3490926058



##########
be/src/format/arrow/arrow_block_convertor.cpp:
##########
@@ -90,9 +203,15 @@ Status 
FromBlockToRecordBatchConverter::convert(std::shared_ptr<arrow::RecordBat
         }
         _cur_builder = builder.get();
         try {
-            RETURN_IF_ERROR(_cur_type->get_serde()->write_column_to_arrow(

Review Comment:
   This UUID conversion only handles top-level fields. 
`ArrowSchemaUtil::convert_to()` also tags UUID fields inside structs/lists/maps 
as `fixed_size_binary(16)` with `originalType=uuid`, while FE recursively maps 
those UUIDs to Doris `STRING` when `enable_mapping_varbinary` is false. Nested 
serialization bypasses this hook: the struct/list/map serdes call the child 
string serde directly, and that fixed-size branch still only accepts 16-byte 
payloads. A nested value like `struct<id: uuid>` containing 
`550e8400-e29b-41d4-a716-446655440000` will therefore still fail with `Fixed 
size binary column expects 16 bytes, got 36`. Please make the UUID 
string-to-byte conversion available in the recursive child paths as well, and 
add nested UUID coverage so the default mapping works beyond top-level columns.



##########
be/src/format/table/iceberg/arrow_schema_util.cpp:
##########
@@ -75,13 +76,25 @@ Status ArrowSchemaUtil::convert_to(const 
iceberg::NestedField& field,
         break;
     }
 
-    case iceberg::TypeID::BINARY:
     case iceberg::TypeID::STRING:
-    case iceberg::TypeID::UUID:
-    case iceberg::TypeID::FIXED:
         arrow_type = arrow::utf8();
         break;
 
+    case iceberg::TypeID::BINARY:

Review Comment:
   This maps Iceberg `binary` onto Arrow's 32-bit-offset `binary()` type, but 
the large-column safeguard still only handles `utf8`: 
`convert_to_arrow_batch()` upgrades oversized UTF8 columns to `large_utf8()`, 
while a binary column with `column->byte_size() >= MAX_ARROW_UTF8` stays on 
`BinaryBuilder`. The new string/varbinary writers also have no `LARGE_BINARY` 
branch, even though the read side supports large binary and the existing 
big-string test shows Doris expects to handle >2GB Arrow payloads. Large 
Iceberg binary writes can therefore fail on Arrow binary offset overflow. 
Please add the analogous `large_binary()` fallback and writer support, plus 
coverage mirroring the big-string case for an Iceberg/BINARY schema.



##########
be/src/core/data_type_serde/data_type_string_serde.cpp:
##########
@@ -246,6 +246,29 @@ Status 
DataTypeStringSerDeBase<ColumnType>::write_column_to_arrow(
     } else if (array_builder->type()->id() == arrow::Type::STRING) {
         auto& builder = assert_cast<arrow::StringBuilder&>(*array_builder);
         return write_column_to_arrow_impl(column, null_map, builder, start, 
end);
+    } else if (array_builder->type()->id() == arrow::Type::BINARY) {
+        auto& builder = assert_cast<arrow::BinaryBuilder&>(*array_builder);
+        return write_column_to_arrow_impl(column, null_map, builder, start, 
end);
+    } else if (array_builder->type()->id() == arrow::Type::FIXED_SIZE_BINARY) {
+        auto& builder = 
assert_cast<arrow::FixedSizeBinaryBuilder&>(*array_builder);
+        const int byte_width =
+                static_cast<const 
arrow::FixedSizeBinaryType&>(*array_builder->type()).byte_width();
+        const auto& string_column = assert_cast<const ColumnType&>(column);
+        for (size_t string_i = start; string_i < end; ++string_i) {
+            if (null_map && (*null_map)[string_i]) {
+                RETURN_IF_ERROR(checkArrowStatus(builder.AppendNull(), column, 
builder));
+                continue;
+            }
+            auto string_ref = string_column.get_data_at(string_i);
+            if (string_ref.size != byte_width) {

Review Comment:
   This width check also breaks the default Iceberg `fixed[N]` mapping. FE maps 
`fixed[4]` to Doris `CHAR(4)` when `enable_mapping_varbinary` is false, and 
Iceberg inserts coerce outputs to that target type, but Doris `CHAR` values are 
still held in variable-length `ColumnString` until the OLAP storage path pads 
them. The Iceberg writer bypasses `OlapColumnDataConvertorChar`, builds 
`fixed_size_binary(4)`, and calls this branch directly, so a normal `CHAR(4)` 
value such as `CAST('ab' AS CHAR(4))` reaches here with size 2 and fails with 
`Fixed size binary column expects 4 bytes, got 2`. Please either apply the same 
CHAR padding/alignment before writing Iceberg `fixed[N]`, or keep the default 
mapping/schema conversion on a type that requires exact-width binary payloads, 
and add coverage for a shorter `CHAR(N)` value.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] [fix](iceberg) Write binary columns with proper Arrow types [doris]

Reply via email to