Re: [PR] [python/hotfix] fix data evolution merge batch size mismatch issue [paimon]

via GitHub Tue, 03 Feb 2026 23:33:16 -0800


JingsongLi commented on code in PR #7205:
URL: https://github.com/apache/paimon/pull/7205#discussion_r2762607267



##########
paimon-python/pypaimon/read/reader/concat_batch_reader.py:
##########
@@ -162,27 +162,56 @@ def __init__(
         self.field_offsets = field_offsets
         self.readers = readers
         self.schema = schema
+        self._buffers: List[Optional[Tuple[RecordBatch, int]]] = [None] * 
len(readers)
 
     def read_arrow_batch(self) -> Optional[RecordBatch]:
         batches: List[Optional[RecordBatch]] = [None] * len(self.readers)
         for i, reader in enumerate(self.readers):
             if reader is not None:
-                batch = reader.read_arrow_batch()
-                if batch is None:
-                    # all readers are aligned, as long as one returns null, 
the others will also have no data
-                    return None
-                batches[i] = batch
-        # Assemble record batches from batches based on row_offsets and 
field_offsets
+                if self._buffers[i] is not None:
+                    batch, offset = self._buffers[i]
+                    self._buffers[i] = None
+                    remainder = batch.slice(offset, batch.num_rows - offset)
+                    new_batch = reader.read_arrow_batch()

Review Comment:
   Why we need to read a new batch if there is a batch? Maybe just check min 
batch size (for example, 1024), and read a new batch?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [PR] [python/hotfix] fix data evolution merge batch size mismatch issue [paimon]

Reply via email to