Re: [PR] Added test for count() method and documentation for count() [iceberg-python]

via GitHub Wed, 03 Sep 2025 11:03:40 -0700


gabeiglio commented on code in PR #2423:
URL: https://github.com/apache/iceberg-python/pull/2423#discussion_r2319592273



##########
tests/table/test_count.py:
##########
@@ -0,0 +1,58 @@
+import pytest
+from unittest.mock import MagicMock, Mock, patch
+from pyiceberg.table import DataScan
+from pyiceberg.expressions import AlwaysTrue
+
+class DummyFile:
+    def __init__(self, record_count):
+        self.record_count = record_count
+
+class DummyTask:
+    def __init__(self, record_count, residual=None, delete_files=None):
+        self.file = DummyFile(record_count)
+        self.residual = residual if residual is not None else AlwaysTrue()
+        self.delete_files = delete_files or []
+
+def test_count_basic():
+    # Create a mock table with the necessary attributes
+    table = Mock(spec=DataScan)

Review Comment:
   nit: We should call this variable `scan` rather than `table`  since we are 
mocking a `DataScan` object



##########
tests/table/test_count.py:
##########
@@ -0,0 +1,58 @@
+import pytest
+from unittest.mock import MagicMock, Mock, patch
+from pyiceberg.table import DataScan
+from pyiceberg.expressions import AlwaysTrue
+
+class DummyFile:
+    def __init__(self, record_count):
+        self.record_count = record_count
+
+class DummyTask:
+    def __init__(self, record_count, residual=None, delete_files=None):
+        self.file = DummyFile(record_count)
+        self.residual = residual if residual is not None else AlwaysTrue()
+        self.delete_files = delete_files or []
+
+def test_count_basic():
+    # Create a mock table with the necessary attributes
+    table = Mock(spec=DataScan)
+
+    # Mock the plan_files method to return our dummy task
+    task = DummyTask(42, residual=AlwaysTrue(), delete_files=[])
+    table.plan_files = MagicMock(return_value=[task])
+
+    # Import and call the actual count method
+    from pyiceberg.table import DataScan as ActualDataScan
+    table.count = ActualDataScan.count.__get__(table, ActualDataScan)
+
+    assert table.count() == 42
+
+def test_count_empty():
+    # Create a mock table with the necessary attributes
+    table = Mock(spec=DataScan)

Review Comment:
   same here to rename to scan



##########
tests/table/test_count.py:
##########
@@ -0,0 +1,58 @@
+import pytest
+from unittest.mock import MagicMock, Mock, patch
+from pyiceberg.table import DataScan
+from pyiceberg.expressions import AlwaysTrue
+
+class DummyFile:
+    def __init__(self, record_count):
+        self.record_count = record_count
+
+class DummyTask:
+    def __init__(self, record_count, residual=None, delete_files=None):
+        self.file = DummyFile(record_count)
+        self.residual = residual if residual is not None else AlwaysTrue()
+        self.delete_files = delete_files or []
+
+def test_count_basic():
+    # Create a mock table with the necessary attributes
+    table = Mock(spec=DataScan)
+
+    # Mock the plan_files method to return our dummy task
+    task = DummyTask(42, residual=AlwaysTrue(), delete_files=[])
+    table.plan_files = MagicMock(return_value=[task])
+
+    # Import and call the actual count method
+    from pyiceberg.table import DataScan as ActualDataScan
+    table.count = ActualDataScan.count.__get__(table, ActualDataScan)
+
+    assert table.count() == 42
+
+def test_count_empty():
+    # Create a mock table with the necessary attributes
+    table = Mock(spec=DataScan)
+
+    # Mock the plan_files method to return no tasks
+    table.plan_files = MagicMock(return_value=[])
+
+    # Import and call the actual count method
+    from pyiceberg.table import DataScan as ActualDataScan
+    table.count = ActualDataScan.count.__get__(table, ActualDataScan)
+
+    assert table.count() == 0
+
+def test_count_large():
+    # Create a mock table with the necessary attributes
+    table = Mock(spec=DataScan)

Review Comment:
   and here



##########
mkdocs/docs/recipe-count.md:
##########
@@ -0,0 +1,96 @@
+---
+title: Count Recipe - Efficiently Count Rows in Iceberg Tables
+---
+
+# Counting Rows in an Iceberg Table
+
+This recipe demonstrates how to use the `count()` function to efficiently 
count rows in an Iceberg table using PyIceberg. The count operation is 
optimized for performance by reading file metadata rather than scanning actual 
data.
+
+## How Count Works
+
+The `count()` method leverages Iceberg's metadata architecture to provide fast 
row counts by:
+
+1. **Reading file manifests**: Examines metadata about data files without 
loading the actual data
+2. **Aggregating record counts**: Sums up record counts stored in Parquet file 
footers
+3. **Applying filters at metadata level**: Pushes down predicates to skip 
irrelevant files
+4. **Handling deletes**: Automatically accounts for delete files and tombstones
+
+## Basic Usage
+
+Count all rows in a table:
+
+```python

Review Comment:
   It could be worth mentioning as a note that we could get the total count of 
a table from snapshot properties doing this:
   `table.current_snapshot().summary.additional_properties["total-records"]`
   
   so users can avoid doing a full table scan



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] Added test for count() method and documentation for count() [iceberg-python]

Reply via email to