Re: [PR] GH-30915: [C++][Python] Add missing methods to `RecordBatch` [arrow]

via GitHub Thu, 29 Feb 2024 06:25:35 -0800


jorisvandenbossche commented on code in PR #39506:
URL: https://github.com/apache/arrow/pull/39506#discussion_r1507576623



##########
python/pyarrow/table.pxi:
##########
@@ -2483,6 +2549,254 @@ cdef class RecordBatch(_Tabular):
     def __sizeof__(self):
         return super(RecordBatch, self).__sizeof__() + self.nbytes
 
+    def add_column(self, int i, field_, column):
+        """
+        Add column to RecordBatch at position.

Review Comment:
   ```suggestion
           Add column to RecordBatch at position i.
   ```
   
   (I know this was copy pasted from the existing docstring, but just from 
reading it now, this sounds better to me)



##########
python/pyarrow/table.pxi:
##########
@@ -2483,6 +2549,254 @@ cdef class RecordBatch(_Tabular):
     def __sizeof__(self):
         return super(RecordBatch, self).__sizeof__() + self.nbytes
 
+    def add_column(self, int i, field_, column):
+        """
+        Add column to RecordBatch at position.
+
+        A new record batch is returned with the column added, the original 
record batch
+        object is left unchanged.
+
+        Parameters
+        ----------
+        i : int
+            Index to place the column at.
+        field_ : str or Field
+            If a string is passed then the type is deduced from the column
+            data.
+        column : Array or value coercible to array
+            Column data.
+
+        Returns
+        -------
+        RecordBatch
+            New record batch with the passed column added.
+
+        Examples
+        --------
+        >>> import pyarrow as pa
+        >>> import pandas as pd
+        >>> df = pd.DataFrame({'n_legs': [2, 4, 5, 100],
+        ...                    'animals': ["Flamingo", "Horse", "Brittle 
stars", "Centipede"]})
+        >>> batch = pa.RecordBatch.from_pandas(df)
+
+        Add column:
+
+        >>> year = [2021, 2022, 2019, 2021]
+        >>> batch.add_column(0,"year", year)
+        pyarrow.RecordBatch
+        year: int64
+        n_legs: int64
+        animals: string
+        ----
+        year: [2021,2022,2019,2021]
+        n_legs: [2,4,5,100]
+        animals: ["Flamingo","Horse","Brittle stars","Centipede"]
+
+        Original record batch is left unchanged:
+
+        >>> batch
+        pyarrow.RecordBatch
+        n_legs: int64
+        animals: string
+        ----
+        n_legs: [2,4,5,100]
+        animals: ["Flamingo","Horse","Brittle stars","Centipede"]
+        """
+        cdef:
+            shared_ptr[CRecordBatch] c_batch
+            Field c_field
+            Array c_arr
+
+        if isinstance(column, Array):
+            c_arr = column
+        else:
+            c_arr = array(column)
+
+        if isinstance(field_, Field):
+            c_field = field_
+        else:
+            c_field = field(field_, c_arr.type)
+
+        with nogil:
+            c_batch = GetResultValue(self.batch.AddColumn(
+                i, c_field.sp_field, c_arr.sp_array))
+
+        return pyarrow_wrap_batch(c_batch)
+
+    def append_column(self, field_, column):

Review Comment:
   This is in theory a method that could be moved to the base class?



##########
python/pyarrow/table.pxi:
##########
@@ -5217,6 +5533,13 @@ def record_batch(data, names=None, schema=None, 
metadata=None):
     if isinstance(data, (list, tuple)):
         return RecordBatch.from_arrays(data, names=names, schema=schema,
                                        metadata=metadata)
+
+    elif isinstance(data, dict):
+        if names is not None:
+            raise ValueError(
+                "The 'names' argument is not valid when passing a dictionary")
+        return RecordBatch.from_pydict(data, schema=schema, metadata=metadata)

Review Comment:
   Sorry, while going through your PR and seeing the docstring examples, I was 
thinking again that we should support a dict in `record_batch`, and quickly did 
a PR for that before getting to this point of your PR .. (and without reading 
your description properly). 
   I included a bit more docstring changes in 
https://github.com/apache/arrow/pull/40292, so will keep it or that.



##########
python/pyarrow/tests/test_table.py:
##########
@@ -1286,6 +1290,30 @@ def test_table_add_column():
     assert t4.equals(expected)
 
 
+def test_record_batch_add_column():

Review Comment:
   We could maybe parametrize the existing test for Table instead. 
   
   If you search for the `@pytest.mark.parametrize` cases already in this file, 
you will see some other examples



##########
python/pyarrow/table.pxi:
##########
@@ -2688,6 +3002,69 @@ cdef class RecordBatch(_Tabular):
 
         return result
 
+    def cast(self, Schema target_schema, safe=None, options=None):

Review Comment:
   A PR was merged that also added a `cast` method, but a bit different. So you 
will have to deduplicate this when updating with latest main



##########
cpp/src/arrow/record_batch.cc:
##########
@@ -303,6 +303,35 @@ Result<std::shared_ptr<RecordBatch>> 
RecordBatch::ReplaceSchema(
   return RecordBatch::Make(std::move(schema), num_rows(), columns());
 }
 
+std::vector<std::string> RecordBatch::ColumnNames() const {
+  std::vector<std::string> names(num_columns());
+  for (int i = 0; i < num_columns(); ++i) {
+    names[i] = schema()->field(i)->name();
+  }
+  return names;
+}
+
+Result<std::shared_ptr<RecordBatch>> RecordBatch::RenameColumns(
+    const std::vector<std::string>& names) const {
+  int n = static_cast<int>(num_columns());

Review Comment:
   `num_columns` already returns an `int`, is the cast needed then?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [PR] GH-30915: [C++][Python] Add missing methods to `RecordBatch` [arrow]

Reply via email to