Re: [PR] [python] optimize schema validation and support binary/large_binary type conversion [paimon]

via GitHub Mon, 06 Apr 2026 00:42:06 -0700


XiaoHongbo-Hope commented on code in PR #7088:
URL: https://github.com/apache/paimon/pull/7088#discussion_r3038421641



##########
paimon-python/pypaimon/tests/ray_data_test.py:
##########
@@ -115,6 +116,66 @@ def test_basic_ray_data_read(self):
         self.assertIsNotNone(ray_dataset, "Ray dataset should not be None")
         self.assertEqual(ray_dataset.count(), 5, "Should have 5 rows")
 
+    def test_ray_data_read_with_blob(self):

Review Comment:
   > This test can be passed in the master.
   
   You're right that the direct read → write roundtrip passes on master. The 
issue occurs when map_batches with dict is involved.  
   
   When a user function in map_batches returns a Python dict (which is the most 
common pattern), PyArrow re-infers bytes as binary, losing the original 
large_binary (BLOB) type: 
   ` large_binary → to_pylist() → Python bytes → from_pydict() → binary`
   see the simple case: `test_dict_return_loses_large_binary_type` for better 
understanding.
   
   So the pipeline to_ray() → map_batches(fn returning dict) → write_ray() 
fails because _validate_pyarrow_schema rejects binary  against a large_binary 
table schema. We can do  a schema cast when write ray or let user work around 
to convert the right schema in user code. Both should be ok, what is your 
suggestion. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [PR] [python] optimize schema validation and support binary/large_binary type conversion [paimon]

Reply via email to