timsaucer commented on code in PR #1241:
URL: 
https://github.com/apache/datafusion-python/pull/1241#discussion_r2355817186


##########
python/datafusion/dataframe.py:
##########
@@ -370,6 +375,39 @@ def schema(self) -> pa.Schema:
             Describing schema of the DataFrame
         """
         return self.df.schema()
+    
+    def to_batches(self) -> list[pa.RecordBatch]:
+        """Convert DataFrame to list of RecordBatches."""
+        return self.collect()  # delegate to existing method
+    
+    def interpolate(self, method: str = "forward_fill", **kwargs) -> DataFrame:
+        """Interpolate missing values per column.
+        
+        Args:
+            method: Interpolation method ('linear', 'forward_fill', 
'backward_fill')

Review Comment:
   At the outset this doesn't look quite right to me.
   
   - The method is called `interpolate` but the one interpolation method is the 
one not supported. The others are filling operations, not interpolations.
   - This looks like it's going to create a very wrong filling - every field in 
the schema gets sorted and then filled? I would expect we would need an 
ordering column to determine the filling method.
   - If we were to do this I think the most pleasant experience would have an 
enum for the possible values with the simple string conversions necessary.



##########
python/tests/test_dataframe.py:
##########
@@ -185,6 +185,47 @@ def get_header_style(self) -> str:
             "padding: 10px; border: 1px solid #3367d6;"
         )
 
+def test_to_batches(df):
+    """Test to_batches method returns list of RecordBatches."""
+    batches = df.to_batches()
+    assert isinstance(batches, list)
+    assert len(batches) > 0
+    assert all(isinstance(batch, pa.RecordBatch) for batch in batches)
+    
+
+    collect_batches = df.collect()
+    assert len(batches) == len(collect_batches)
+    for i, batch in enumerate(batches):
+        assert batch.equals(collect_batches[i])
+
+
+def test_interpolate_forward_fill(ctx):
+    """Test interpolate method with forward_fill."""
+    
+    batch = pa.RecordBatch.from_arrays(
+        [pa.array([1, None, 3, None]), pa.array([4.0, None, 6.0, None])],
+        names=["int_col", "float_col"],
+    )
+    df = ctx.create_dataframe([[batch]])
+    
+    result = df.interpolate("forward_fill")
+    
+    assert isinstance(result, DataFrame)

Review Comment:
   We would probably want to collect the results and verify they fill as 
expected.



##########
python/datafusion/dataframe.py:
##########
@@ -592,6 +630,9 @@ def tail(self, n: int = 5) -> DataFrame:
         """
         return DataFrame(self.df.limit(n, max(0, self.count() - n)))
 
+    @deprecated(
+    "collect() returning RecordBatch list is deprecated. Use to_batches() for 
RecordBatch list or collect() will return DataFrame in future versions"
+    )

Review Comment:
   I disagree with deprecating `collect`. I think we should keep our main 
functions close to those in `datafusion`, but with the appropriate sugar to 
make them pleasant to use in Python. Also, changing `collect` to return a 
dataframe instead of a list of record batches would be a *major* breaking 
change to many people's workflows.



##########
python/datafusion/dataframe.py:
##########
@@ -360,6 +362,9 @@ def describe(self) -> DataFrame:
         """
         return DataFrame(self.df.describe())
 
+    @deprecated(
+        "schema() is deprecated. Use :py:meth:`~DataFrame.get_schema` instead"
+    )

Review Comment:
   I'm confused by this. I don't see a `get_schema` method in this file and I 
don't see why we'd want to deprecate this method anyways.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to