XiaoHongbo-Hope commented on code in PR #8162:
URL: https://github.com/apache/paimon/pull/8162#discussion_r3372771929
##########
paimon-python/pypaimon/write/table_update_by_row_id.py:
##########
@@ -317,15 +321,49 @@ def _merge_update_with_original(
for i in range(original_data.num_rows)
]
else:
- # replace_with_mask fills mask=True positions with update
values in order
- merged_columns[col_name] = pc.replace_with_mask(
- original_col, mask, update_col.cast(original_col.type)
- )
+ try:
+ merged_columns[col_name] = pc.replace_with_mask(
+ original_col, mask, update_col)
+ except pa.lib.ArrowNotImplementedError:
+ n = original_data.num_rows
+ combined = pa.concat_arrays(
+ [original_col, update_col])
+ offset = len(original_col)
+ indices = np.arange(n, dtype=np.int64)
+ for orig_pos, upd_idx in update_positions.items():
+ indices[orig_pos] = offset + upd_idx
+ merged_columns[col_name] = combined.take(
+ pa.array(indices))
merged_table = pa.table(merged_columns) if merged_columns else None
return merged_table, blob_columns
+ @staticmethod
+ def _coerce_column(col: pa.Array, target_type: pa.DataType) -> pa.Array:
+ try:
+ return col.cast(target_type)
+ except (pa.lib.ArrowNotImplementedError,
+ pa.lib.ArrowInvalid,
+ pa.lib.ArrowTypeError):
+ pass
+ pylist = col.to_pylist()
+ if pa.types.is_map(target_type):
+ converted = []
+ for row in pylist:
+ if row is None:
+ converted.append(None)
+ elif isinstance(row, dict):
Review Comment:
> `_coerce_column` drops every None value when converting inferred dict
input to map by filtering if v is not None. This loses valid map entries like
{'a': None} when callers pass natural dict-shaped PyArrow input without an
explicit schema. I reproduced it end-to-end: updating a map<string,string>
column with {'a': None} reads back as [], not [('a', None)]. The added test
covers null values only via explicit pa.map_ list-of-pairs schema, so it misses
this regression.
Thanks for catching this. I changed it to fail fast for now because I don’t
have a better safe idea here.
After PyArrow infers schema-less dict input as a struct, the information is
already ambiguous: a None field can mean either an Arrow-padded missing dict
key or a user-provided explicit null map value. If we drop nulls, {'a': None}
is corrupted; if we keep nulls, heterogeneous dicts like {'a': '1'}, {'b': '2'}
get false null entries.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]