kevinjqliu commented on code in PR #3364:
URL: https://github.com/apache/iceberg-python/pull/3364#discussion_r3409040690


##########
pyiceberg/table/__init__.py:
##########
@@ -2104,115 +2180,357 @@ def plan_files(self) -> Iterable[FileScanTask]:
         return self._plan_files_local()
 
     def to_arrow(self) -> pa.Table:
-        """Read an Arrow table eagerly from this DataScan.
+        return _to_arrow_via_file_scan_tasks(self, self.plan_files())
 
-        All rows will be loaded into memory at once.
+    def to_arrow_batch_reader(self) -> pa.RecordBatchReader:
+        return _to_arrow_batch_reader_via_file_scan_tasks(self, 
self.plan_files())
 
-        Returns:
-            pa.Table: Materialized Arrow Table from the Iceberg table's 
DataScan
-        """
+    def count(self) -> int:
         from pyiceberg.io.pyarrow import ArrowScan
 
-        return ArrowScan(
-            self.table_metadata, self.io, self.projection(), self.row_filter, 
self.case_sensitive, self.limit
-        ).to_table(self.plan_files())
+        # Usage: Calculates the total number of records in a Scan that haven't 
had positional deletes.
+        res = 0
+        # every task is a FileScanTask
+        tasks = self.plan_files()
 
-    def to_arrow_batch_reader(self) -> pa.RecordBatchReader:
-        """Return an Arrow RecordBatchReader from this DataScan.

Review Comment:
   would be good to keep the docstrings for this and other functions
   
   
   Actual dropped or weakened docs in pyiceberg/table/__init__.py:
   1. DataScan.to_arrow
   Old public docstring explained eager Arrow materialization and that all rows 
load into memory. Current method at line 2179 (line 2179) has no docstring. The 
helper at line 2056 (line 2056) has only a terse private/helper docstring.
   
   2. DataScan.to_arrow_batch_reader
   Old public docstring explained RecordBatchReader, lower memory use for large 
results, and return type. Current method at line 2182 (line 2182) has no 
docstring. The helper at line 2065 (line 2065) does not preserve the 
user-facing memory explanation.
   
   3. DataScan._check_sequence_number -> 
ManifestGroupPlanner._check_sequence_number
   Old private docstring explained why older delete manifests are skipped, and 
had the inline comment Not interested in deletes that are older than the data. 
Current method at line 2526 (line 2526) has no docstring/comment.
   
   And add equivalent ones for IncrementalAppendScan.to_arrow / 
to_arrow_batch_reader



##########
pyiceberg/table/__init__.py:
##########
@@ -1707,20 +1771,130 @@ def __init__(
         row_filter: str | BooleanExpression = ALWAYS_TRUE,
         selected_fields: tuple[str, ...] = ("*",),
         case_sensitive: bool = True,
-        snapshot_id: int | None = None,
         options: Properties = EMPTY_DICT,
         limit: int | None = None,
-        catalog: Catalog | None = None,
-        table_identifier: Identifier | None = None,
     ):
         self.table_metadata = table_metadata
         self.io = io
         self.row_filter = _parse_row_filter(row_filter)
         self.selected_fields = selected_fields
         self.case_sensitive = case_sensitive
-        self.snapshot_id = snapshot_id
         self.options = options
         self.limit = limit
+
+    @abstractmethod
+    def projection(self) -> Schema: ...
+
+    @abstractmethod
+    def plan_files(self) -> Iterable[ScanTask]: ...
+
+    @abstractmethod
+    def to_arrow(self) -> pa.Table: ...
+
+    @abstractmethod

Review Comment:
   ```suggestion
   ```
   
   `to_arrow_batch_reader` was previously not a `@abstractmethod`, adding it 
here means external TableScan subclasses that were valid before may no longer 
instantiate.
   
   it makes sense to add `@abstractmethod` but lets do that separately
   



##########
pyiceberg/table/__init__.py:
##########


Review Comment:
   DataScan._plan_files_local now duplicates most of 
ManifestGroupPlanner.plan_files. It is not a correctness issue, but a follow-up 
simplification could route DataScan through the shared planner too.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to